Well, I guess I am looking at a few things - Running nightly, as I said Using the last-modified-date header returned by the server to determine if I even want to download the whole file - if the last modifed date has not changed, and the file size is the same, then I can probably skip it.
Of course, this pre-supposes that I am only updating a database - it seems sort of rediculous that currently, the only easy method of recrawling a site is to create a new db. On 10/21/05, Michael Ji <[EMAIL PROTECTED]> wrote: > I guess you can run segmentMergeTool to merge new > segments with previous one ( document with duplicated > URL and content MD5 will be discarded) and then run > index on it, > > not sure if it is the best scenario for daily > refetching---just my thought based on the code I dig > out, > > Michael Ji, > > --- Lokkju <[EMAIL PROTECTED]> wrote: > > > I have searched through the mail archives, and seen > > this question > > asked alot, but no answer ever seems to come back. > > I am going to be > > using nutch against 5 sites, and I want to update > > the index on a > > nightly basis. Besides deleting the previous crawl, > > then running it > > again, what method of doing nightly updates is > > recommended? > > > > Thanks, > > Nick > > > > > > > > __________________________________ > Yahoo! Mail - PC Magazine Editors' Choice 2005 > http://mail.yahoo.com >
