Hi, Andrzej,
Could you give us a brief on what you are going to change, so that we can wheather your storm better ;-)? Thanks,
Sure ;-)
By "accuracy" I mean the ratio of total number of valid pages to the number of pages collected by a crawler (perhaps I should use the term "recall"?). Currently Nutch does here a decent, but not exceptionally good work. The goal of this assignment is to bring our crawler to the level where it can collect per site at least 90% of pages collected by Google for the same site.
There are a couple of problems in our crawler, which I will want to address with these patches:
* cookie support. there were a few patchsets floating around, I would like to select the best approach and add it. For some sites cookie support is absolutely required in order to get past the front page, for others it can give access to previously inaccessible areas.
* some sites use extensively the "meta" refresh and redirect directive, to make sure that you always visit certain page first (which usually sets cookies or some such). Currently the interaction between Fetcher, protocol plugins and parser plugins is such that it is nearly impossible to implement this. I already have a set of patches which preserve the layering, change the interaction to status-driven instead of exception-driven, and allow to follow multiple redirects if neccessary.
* JavaScript parsing: currently we ignore JavaScript completely. There are many sites where a lot of links (e.g. menus) are built dynamically. Currently it's impossible to harvest such links. I already made some tests with a full JavaScript interpreter (using Rhino), but it's too slow for massive crawling. A "good enough" solution similar to the one used in other crawlers is needed, namely to use a heuristic JS parser :-) (that is, try to match possible URLs within the script text - somewhat similar to the plain text link extractor).
* session support: for some sites, there is a session state that is accumulated by Fetcher as it goes from one page to another, which at the end of the run is effectively lost. This causes some pages to be missed or some areas to become inaccessible. Related to this is also accessing password-protected pages. I will investigate methods to save and pick up session states, between consecutive Fetcher runs.
* fetching modified content only: this is related to the interaction between Fetcher and protocol plugins. A simple change in Protocol interface will allow all protocol plugins to decide, based on protocol headers, whether the content has been changed since the last fetching, and if they need to fetch the new content. This will result in tremendous bandwidth/disk/CPU savings.
* related to the above, an implementation of adaptive interval fetching. This has been discussed in the past, and I had a well-working patchset almost a year ago, but was too busy to keep it up to date with the changes. I believe the time has come to finalize this.
* any other issues that pop up on the way...?
I will work on these areas intensively during the next 2-3 weeks, and will release patches piece-wise, so that we can discuss them. You all are more than welcome to join and help!
-- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
