Hi Andrezj, > > * JavaScript parsing: currently we ignore JavaScript completely. There > are many sites where a lot of links (e.g. menus) are built dynamically. > Currently it's impossible to harvest such links. I already made some > tests with a full JavaScript interpreter (using Rhino), but it's too > slow for massive crawling. A "good enough" solution similar to the one > used in other crawlers is needed, namely to use a heuristic JS parser > :-) (that is, try to match possible URLs within the script text - > somewhat similar to the plain text link extractor).
If you need help with this particular part, I would love to help you. Please drop me an email offline with directions and suggestions if you would like me to help and coordinate on this. > * fetching modified content only: this is related to the interaction > between Fetcher and protocol plugins. A simple change in Protocol > interface will allow all protocol plugins to decide, based on protocol > headers, whether the content has been changed since the last fetching, > and if they need to fetch the new content. This will result in > tremendous bandwidth/disk/CPU savings. I could totally help out with this. Same deal, if you would like my help on this, please let me know. Thanks, Chris
