It would seem that MD5 hash would still require you to actually get all the remote content - but I'll look at the patch, and perhaps it can give me some ideas on using the last-modified-date and the content-size.
On 10/21/05, Michael Ji <[EMAIL PROTECTED]> wrote: > there is a nutch patch 61 > http://issues.apache.org/jira/browse/NUTCH-61 > > to detect the unmodified content of a target page by > looking for its' content MD5 hash value; somehow, it > is not merged to branch yet; I implemented patch 61 > for my local development, but no further testing yet; > > for the refetching, you only have to generate a new > fetchlist---not a new db; > > Michael Ji, > > --- Lokkju <[EMAIL PROTECTED]> wrote: > > > Well, I guess I am looking at a few things - > > > > Running nightly, as I said > > Using the last-modified-date header returned by the > > server to > > determine if I even want to download the whole file > > - if the last > > modifed date has not changed, and the file size is > > the same, then I > > can probably skip it. > > > > Of course, this pre-supposes that I am only updating > > a database - it > > seems sort of rediculous that currently, the only > > easy method of > > recrawling a site is to create a new db. > > > > On 10/21/05, Michael Ji <[EMAIL PROTECTED]> wrote: > > > I guess you can run segmentMergeTool to merge new > > > segments with previous one ( document with > > duplicated > > > URL and content MD5 will be discarded) and then > > run > > > index on it, > > > > > > not sure if it is the best scenario for daily > > > refetching---just my thought based on the code I > > dig > > > out, > > > > > > Michael Ji, > > > > > > --- Lokkju <[EMAIL PROTECTED]> wrote: > > > > > > > I have searched through the mail archives, and > > seen > > > > this question > > > > asked alot, but no answer ever seems to come > > back. > > > > I am going to be > > > > using nutch against 5 sites, and I want to > > update > > > > the index on a > > > > nightly basis. Besides deleting the previous > > crawl, > > > > then running it > > > > again, what method of doing nightly updates is > > > > recommended? > > > > > > > > Thanks, > > > > Nick > > > > > > > > > > > > > > > > > > > > > > __________________________________ > > > Yahoo! Mail - PC Magazine Editors' Choice 2005 > > > http://mail.yahoo.com > > > > > > > > > > __________________________________ > Yahoo! FareChase: Search multiple travel sites in one click. > http://farechase.yahoo.com >
