hi Andrzej 

The improvements sound great.
Any plan to support form credential? 

/Jack 

On 4/29/05, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> John X wrote:
> > Hi, Andrzej,
> >
> > Could you give us a brief on what you are going to change,
> > so that we can wheather your storm better ;-)?
> > Thanks,
> 
> Sure ;-)
> 
> By "accuracy" I mean the ratio of total number of valid pages to the
> number of pages collected by a crawler (perhaps I should use the term
> "recall"?). Currently Nutch does here a decent, but not exceptionally
> good work. The goal of this assignment is to bring our crawler to the
> level where it can collect per site at least 90% of pages collected by
> Google for the same site.
> 
> There are a couple of problems in our crawler, which I will want to
> address with these patches:
> 
> * cookie support. there were a few patchsets floating around, I would
> like to select the best approach and add it. For some sites cookie
> support is absolutely required in order to get past the front page, for
> others it can give access to previously inaccessible areas.
> 
> * some sites use extensively the "meta" refresh and redirect directive,
> to make sure that you always visit certain page first (which usually
> sets cookies or some such). Currently the interaction between Fetcher,
> protocol plugins and parser plugins is such that it is nearly impossible
> to implement this. I already have a set of patches which preserve the
> layering, change the interaction to status-driven instead of
> exception-driven, and allow to follow multiple redirects if neccessary.
> 
> * JavaScript parsing: currently we ignore JavaScript completely. There
> are many sites where a lot of links (e.g. menus) are built dynamically.
> Currently it's impossible to harvest such links. I already made some
> tests with a full JavaScript interpreter (using Rhino), but it's too
> slow for massive crawling. A "good enough" solution similar to the one
> used in other crawlers is needed, namely to use a heuristic JS parser
> :-) (that is, try to match possible URLs within the script text -
> somewhat similar to the plain text link extractor).
> 
> * session support: for some sites, there is a session state that is
> accumulated by Fetcher as it goes from one page to another, which at the
> end of the run is effectively lost. This causes some pages to be missed
> or some areas to become inaccessible. Related to this is also accessing
> password-protected pages. I will investigate methods to save and pick up
> session states, between consecutive Fetcher runs.
> 
> * fetching modified content only: this is related to the interaction
> between Fetcher and protocol plugins. A simple change in Protocol
> interface will allow all protocol plugins to decide, based on protocol
> headers, whether the content has been changed since the last fetching,
> and if they need to fetch the new content. This will result in
> tremendous bandwidth/disk/CPU savings.
> 
> * related to the above, an implementation of adaptive interval fetching.
> This has been discussed in the past, and I had a well-working patchset
> almost a year ago, but was too busy to keep it up to date with the
> changes. I believe the time has come to finalize this.
> 
> * any other issues that pop up on the way...?
> 
> I will work on these areas intensively during the next 2-3 weeks, and
> will release patches piece-wise, so that we can discuss them. You all
> are more than welcome to join and help!
> 
> --
> Best regards,
> Andrzej Bialecki
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
>

Reply via email to