Hi Andrezj,

> 
> * JavaScript parsing: currently we ignore JavaScript completely. There
> are many sites where a lot of links (e.g. menus) are built dynamically.
> Currently it's impossible to harvest such links. I already made some
> tests with a full JavaScript interpreter (using Rhino), but it's too
> slow for massive crawling. A "good enough" solution similar to the one
> used in other crawlers is needed, namely to use a heuristic JS parser
> :-) (that is, try to match possible URLs within the script text -
> somewhat similar to the plain text link extractor).

If you need help with this particular part, I would love to help you. Please
drop me an email offline with directions and suggestions if you would like
me to help and coordinate on this.

> * fetching modified content only: this is related to the interaction
> between Fetcher and protocol plugins. A simple change in Protocol
> interface will allow all protocol plugins to decide, based on protocol
> headers, whether the content has been changed since the last fetching,
> and if they need to fetch the new content. This will result in
> tremendous bandwidth/disk/CPU savings.

I could totally help out with this. Same deal, if you would like my help on
this, please let me know.

Thanks,
  Chris

Reply via email to