Chris A Mattmann wrote:
Hi Andrezj,


* JavaScript parsing: currently we ignore JavaScript completely. There
are many sites where a lot of links (e.g. menus) are built dynamically.
Currently it's impossible to harvest such links. I already made some
tests with a full JavaScript interpreter (using Rhino), but it's too
slow for massive crawling. A "good enough" solution similar to the one
used in other crawlers is needed, namely to use a heuristic JS parser
:-) (that is, try to match possible URLs within the script text -
somewhat similar to the plain text link extractor).


If you need help with this particular part, I would love to help you. Please
drop me an email offline with directions and suggestions if you would like
me to help and coordinate on this.

That would be great! My regexp skills are somewhat deficient... ;-)



* fetching modified content only: this is related to the interaction
between Fetcher and protocol plugins. A simple change in Protocol
interface will allow all protocol plugins to decide, based on protocol
headers, whether the content has been changed since the last fetching,
and if they need to fetch the new content. This will result in
tremendous bandwidth/disk/CPU savings.


I could totally help out with this. Same deal, if you would like my help on
this, please let me know.

Specifically, I meant to change the Protocol.getContent(String url) to Protocol.getContent(FetchListEntry fle). This passes on a lot of useful information to protocol plugins.


Then, the programming model needs to be changed from exception-driven to status-driven, using ProtocolStatus - please see my first round of patches for ParseStatus to get an idea. So, either we need to add a ProtocolStatus property to Content, or better yet to change the signature to read:

Protocol.java:

        ProtocolStatus getContent(FetchListEntry fle, Content container);


-- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com



Reply via email to