These are all very important and useful features/patches. I have encountered
some of them in my apps and have some ad hoc fixes, but it'll be nice to
have more general implementations.

Can't wait to see your changes!

Thanks

Jay


-----Original Message-----
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Thursday, April 28, 2005 4:45 PM
To: [email protected]
Cc: [EMAIL PROTECTED]
Subject: Re: Upcoming work on Fetcher


John X wrote:
> Hi, Andrzej,
> 
> Could you give us a brief on what you are going to change,
> so that we can wheather your storm better ;-)?
> Thanks,

Sure ;-)

By "accuracy" I mean the ratio of total number of valid pages to the 
number of pages collected by a crawler (perhaps I should use the term 
"recall"?). Currently Nutch does here a decent, but not exceptionally 
good work. The goal of this assignment is to bring our crawler to the 
level where it can collect per site at least 90% of pages collected by 
Google for the same site.

There are a couple of problems in our crawler, which I will want to 
address with these patches:

* cookie support. there were a few patchsets floating around, I would 
like to select the best approach and add it. For some sites cookie 
support is absolutely required in order to get past the front page, for 
others it can give access to previously inaccessible areas.

* some sites use extensively the "meta" refresh and redirect directive, 
to make sure that you always visit certain page first (which usually 
sets cookies or some such). Currently the interaction between Fetcher, 
protocol plugins and parser plugins is such that it is nearly impossible 
to implement this. I already have a set of patches which preserve the 
layering, change the interaction to status-driven instead of 
exception-driven, and allow to follow multiple redirects if neccessary.

* JavaScript parsing: currently we ignore JavaScript completely. There 
are many sites where a lot of links (e.g. menus) are built dynamically. 
Currently it's impossible to harvest such links. I already made some 
tests with a full JavaScript interpreter (using Rhino), but it's too 
slow for massive crawling. A "good enough" solution similar to the one 
used in other crawlers is needed, namely to use a heuristic JS parser 
:-) (that is, try to match possible URLs within the script text - 
somewhat similar to the plain text link extractor).

* session support: for some sites, there is a session state that is 
accumulated by Fetcher as it goes from one page to another, which at the 
end of the run is effectively lost. This causes some pages to be missed 
or some areas to become inaccessible. Related to this is also accessing 
password-protected pages. I will investigate methods to save and pick up 
session states, between consecutive Fetcher runs.

* fetching modified content only: this is related to the interaction 
between Fetcher and protocol plugins. A simple change in Protocol 
interface will allow all protocol plugins to decide, based on protocol 
headers, whether the content has been changed since the last fetching, 
and if they need to fetch the new content. This will result in 
tremendous bandwidth/disk/CPU savings.

* related to the above, an implementation of adaptive interval fetching. 
This has been discussed in the past, and I had a well-working patchset 
almost a year ago, but was too busy to keep it up to date with the 
changes. I believe the time has come to finalize this.

* any other issues that pop up on the way...?

I will work on these areas intensively during the next 2-3 weeks, and 
will release patches piece-wise, so that we can discuss them. You all 
are more than welcome to join and help!

-- 
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to