there is a nutch patch 61 http://issues.apache.org/jira/browse/NUTCH-61
to detect the unmodified content of a target page by looking for its' content MD5 hash value; somehow, it is not merged to branch yet; I implemented patch 61 for my local development, but no further testing yet; for the refetching, you only have to generate a new fetchlist---not a new db; Michael Ji, --- Lokkju <[EMAIL PROTECTED]> wrote: > Well, I guess I am looking at a few things - > > Running nightly, as I said > Using the last-modified-date header returned by the > server to > determine if I even want to download the whole file > - if the last > modifed date has not changed, and the file size is > the same, then I > can probably skip it. > > Of course, this pre-supposes that I am only updating > a database - it > seems sort of rediculous that currently, the only > easy method of > recrawling a site is to create a new db. > > On 10/21/05, Michael Ji <[EMAIL PROTECTED]> wrote: > > I guess you can run segmentMergeTool to merge new > > segments with previous one ( document with > duplicated > > URL and content MD5 will be discarded) and then > run > > index on it, > > > > not sure if it is the best scenario for daily > > refetching---just my thought based on the code I > dig > > out, > > > > Michael Ji, > > > > --- Lokkju <[EMAIL PROTECTED]> wrote: > > > > > I have searched through the mail archives, and > seen > > > this question > > > asked alot, but no answer ever seems to come > back. > > > I am going to be > > > using nutch against 5 sites, and I want to > update > > > the index on a > > > nightly basis. Besides deleting the previous > crawl, > > > then running it > > > again, what method of doing nightly updates is > > > recommended? > > > > > > Thanks, > > > Nick > > > > > > > > > > > > > > > __________________________________ > > Yahoo! Mail - PC Magazine Editors' Choice 2005 > > http://mail.yahoo.com > > > __________________________________ Yahoo! FareChase: Search multiple travel sites in one click. http://farechase.yahoo.com
