Re: Re-crawling or what?

Lokkju Fri, 21 Oct 2005 08:30:51 -0700

Well, I guess I am looking at a few things -

Running nightly, as I said
Using the last-modified-date header returned by the server to
determine if I even want to download the whole file - if the last
modifed date has not changed, and the file size is the same, then I
can probably skip it.


Of course, this pre-supposes that I am only updating a database - it
seems sort of rediculous that currently, the only easy method of
recrawling a site is to create a new db.

On 10/21/05, Michael Ji <[EMAIL PROTECTED]> wrote:
> I guess you can run segmentMergeTool to merge new
> segments with previous one ( document with duplicated
> URL and content MD5 will be discarded) and then run
> index on it,
>
> not sure if it is the best scenario for daily
> refetching---just my thought based on the code I dig
> out,
>
> Michael Ji,
>
> --- Lokkju <[EMAIL PROTECTED]> wrote:
>
> > I have searched through the mail archives, and seen
> > this question
> > asked alot, but no answer ever seems to come back.
> > I am going to be
> > using nutch against 5 sites, and I want to update
> > the index on a
> > nightly basis.  Besides deleting the previous crawl,
> > then running it
> > again, what method of doing nightly updates is
> > recommended?
> >
> > Thanks,
> > Nick
> >
>
>
>
>
>
> __________________________________
> Yahoo! Mail - PC Magazine Editors' Choice 2005
> http://mail.yahoo.com
>

Re: Re-crawling or what?

Reply via email to