hi Andrzej The improvements sound great. Any plan to support form credential?
/Jack On 4/29/05, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > John X wrote: > > Hi, Andrzej, > > > > Could you give us a brief on what you are going to change, > > so that we can wheather your storm better ;-)? > > Thanks, > > Sure ;-) > > By "accuracy" I mean the ratio of total number of valid pages to the > number of pages collected by a crawler (perhaps I should use the term > "recall"?). Currently Nutch does here a decent, but not exceptionally > good work. The goal of this assignment is to bring our crawler to the > level where it can collect per site at least 90% of pages collected by > Google for the same site. > > There are a couple of problems in our crawler, which I will want to > address with these patches: > > * cookie support. there were a few patchsets floating around, I would > like to select the best approach and add it. For some sites cookie > support is absolutely required in order to get past the front page, for > others it can give access to previously inaccessible areas. > > * some sites use extensively the "meta" refresh and redirect directive, > to make sure that you always visit certain page first (which usually > sets cookies or some such). Currently the interaction between Fetcher, > protocol plugins and parser plugins is such that it is nearly impossible > to implement this. I already have a set of patches which preserve the > layering, change the interaction to status-driven instead of > exception-driven, and allow to follow multiple redirects if neccessary. > > * JavaScript parsing: currently we ignore JavaScript completely. There > are many sites where a lot of links (e.g. menus) are built dynamically. > Currently it's impossible to harvest such links. I already made some > tests with a full JavaScript interpreter (using Rhino), but it's too > slow for massive crawling. A "good enough" solution similar to the one > used in other crawlers is needed, namely to use a heuristic JS parser > :-) (that is, try to match possible URLs within the script text - > somewhat similar to the plain text link extractor). > > * session support: for some sites, there is a session state that is > accumulated by Fetcher as it goes from one page to another, which at the > end of the run is effectively lost. This causes some pages to be missed > or some areas to become inaccessible. Related to this is also accessing > password-protected pages. I will investigate methods to save and pick up > session states, between consecutive Fetcher runs. > > * fetching modified content only: this is related to the interaction > between Fetcher and protocol plugins. A simple change in Protocol > interface will allow all protocol plugins to decide, based on protocol > headers, whether the content has been changed since the last fetching, > and if they need to fetch the new content. This will result in > tremendous bandwidth/disk/CPU savings. > > * related to the above, an implementation of adaptive interval fetching. > This has been discussed in the past, and I had a well-working patchset > almost a year ago, but was too busy to keep it up to date with the > changes. I believe the time has come to finalize this. > > * any other issues that pop up on the way...? > > I will work on these areas intensively during the next 2-3 weeks, and > will release patches piece-wise, so that we can discuss them. You all > are more than welcome to join and help! > > -- > Best regards, > Andrzej Bialecki > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >
