Re: Re-crawling or what?

Michael Ji Fri, 21 Oct 2005 09:31:26 -0700

there is a nutch patch 61
http://issues.apache.org/jira/browse/NUTCH-61


to detect the unmodified content of a target page by
looking for its' content MD5 hash value; somehow, it
is not merged to branch yet; I implemented patch 61
for my local development, but no further testing yet;

for the refetching, you only have to generate a new
fetchlist---not a new db;

Michael Ji,

--- Lokkju <[EMAIL PROTECTED]> wrote:

> Well, I guess I am looking at a few things -
> 
> Running nightly, as I said
> Using the last-modified-date header returned by the
> server to
> determine if I even want to download the whole file
> - if the last
> modifed date has not changed, and the file size is
> the same, then I
> can probably skip it.
> 
> Of course, this pre-supposes that I am only updating
> a database - it
> seems sort of rediculous that currently, the only
> easy method of
> recrawling a site is to create a new db.
> 
> On 10/21/05, Michael Ji <[EMAIL PROTECTED]> wrote:
> > I guess you can run segmentMergeTool to merge new
> > segments with previous one ( document with
> duplicated
> > URL and content MD5 will be discarded) and then
> run
> > index on it,
> >
> > not sure if it is the best scenario for daily
> > refetching---just my thought based on the code I
> dig
> > out,
> >
> > Michael Ji,
> >
> > --- Lokkju <[EMAIL PROTECTED]> wrote:
> >
> > > I have searched through the mail archives, and
> seen
> > > this question
> > > asked alot, but no answer ever seems to come
> back.
> > > I am going to be
> > > using nutch against 5 sites, and I want to
> update
> > > the index on a
> > > nightly basis.  Besides deleting the previous
> crawl,
> > > then running it
> > > again, what method of doing nightly updates is
> > > recommended?
> > >
> > > Thanks,
> > > Nick
> > >
> >
> >
> >
> >
> >
> > __________________________________
> > Yahoo! Mail - PC Magazine Editors' Choice 2005
> > http://mail.yahoo.com
> >
> 



                
__________________________________ 
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com

Re: Re-crawling or what?

Reply via email to