It would seem that MD5 hash would still require you to actually get
all the remote content - but I'll look at the patch, and perhaps it
can give me some ideas on using the last-modified-date and the
content-size.

On 10/21/05, Michael Ji <[EMAIL PROTECTED]> wrote:
> there is a nutch patch 61
> http://issues.apache.org/jira/browse/NUTCH-61
>
> to detect the unmodified content of a target page by
> looking for its' content MD5 hash value; somehow, it
> is not merged to branch yet; I implemented patch 61
> for my local development, but no further testing yet;
>
> for the refetching, you only have to generate a new
> fetchlist---not a new db;
>
> Michael Ji,
>
> --- Lokkju <[EMAIL PROTECTED]> wrote:
>
> > Well, I guess I am looking at a few things -
> >
> > Running nightly, as I said
> > Using the last-modified-date header returned by the
> > server to
> > determine if I even want to download the whole file
> > - if the last
> > modifed date has not changed, and the file size is
> > the same, then I
> > can probably skip it.
> >
> > Of course, this pre-supposes that I am only updating
> > a database - it
> > seems sort of rediculous that currently, the only
> > easy method of
> > recrawling a site is to create a new db.
> >
> > On 10/21/05, Michael Ji <[EMAIL PROTECTED]> wrote:
> > > I guess you can run segmentMergeTool to merge new
> > > segments with previous one ( document with
> > duplicated
> > > URL and content MD5 will be discarded) and then
> > run
> > > index on it,
> > >
> > > not sure if it is the best scenario for daily
> > > refetching---just my thought based on the code I
> > dig
> > > out,
> > >
> > > Michael Ji,
> > >
> > > --- Lokkju <[EMAIL PROTECTED]> wrote:
> > >
> > > > I have searched through the mail archives, and
> > seen
> > > > this question
> > > > asked alot, but no answer ever seems to come
> > back.
> > > > I am going to be
> > > > using nutch against 5 sites, and I want to
> > update
> > > > the index on a
> > > > nightly basis.  Besides deleting the previous
> > crawl,
> > > > then running it
> > > > again, what method of doing nightly updates is
> > > > recommended?
> > > >
> > > > Thanks,
> > > > Nick
> > > >
> > >
> > >
> > >
> > >
> > >
> > > __________________________________
> > > Yahoo! Mail - PC Magazine Editors' Choice 2005
> > > http://mail.yahoo.com
> > >
> >
>
>
>
>
> __________________________________
> Yahoo! FareChase: Search multiple travel sites in one click.
> http://farechase.yahoo.com
>

Reply via email to