Hi Andrezj,
* JavaScript parsing: currently we ignore JavaScript completely. There are many sites where a lot of links (e.g. menus) are built dynamically. Currently it's impossible to harvest such links. I already made some tests with a full JavaScript interpreter (using Rhino), but it's too slow for massive crawling. A "good enough" solution similar to the one used in other crawlers is needed, namely to use a heuristic JS parser :-) (that is, try to match possible URLs within the script text - somewhat similar to the plain text link extractor).
If you need help with this particular part, I would love to help you. Please drop me an email offline with directions and suggestions if you would like me to help and coordinate on this.
That would be great! My regexp skills are somewhat deficient... ;-)
* fetching modified content only: this is related to the interaction between Fetcher and protocol plugins. A simple change in Protocol interface will allow all protocol plugins to decide, based on protocol headers, whether the content has been changed since the last fetching, and if they need to fetch the new content. This will result in tremendous bandwidth/disk/CPU savings.
I could totally help out with this. Same deal, if you would like my help on this, please let me know.
Specifically, I meant to change the Protocol.getContent(String url) to Protocol.getContent(FetchListEntry fle). This passes on a lot of useful information to protocol plugins.
Then, the programming model needs to be changed from exception-driven to status-driven, using ProtocolStatus - please see my first round of patches for ParseStatus to get an idea. So, either we need to add a ProtocolStatus property to Content, or better yet to change the signature to read:
Protocol.java:
ProtocolStatus getContent(FetchListEntry fle, Content container);
-- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
