Re: Re-crawling or what?

Michael Ji Sat, 22 Oct 2005 20:42:24 -0700

hi Stefan:

Actually, I implemented nutch 61 in my local
development and have a discussion with Andrzej ( see
the attached comments from Andrzej )

Mainly, the first difficulty Andrzej pointed out is
repeating "dedupling". This might be solved by calling
SegmentMergeTool.java. Means we only keep a fresh
segment and no need to keep all the old segments.
Ofcause, merge segments has cost.

But the second difficulty of "lost segment", it is
exactly as Andrzej described. No direct solution form
my view yet. Maybe we could rely on the robustness of
our local file system.

My wish to use nutch 61 to save parsing time if page's
content is not changed.

My testing experience (2 months ago) was that I found
nutch 61 DID generate parse_data/ parse_text/ for a
page with unchanged content. (my test might be wrong)
I will run test again to verify that as soon as I have
a bit time.

thanks,

Michael Ji,

(attached, my previous discussion with Andrzej )
=================================================
Unfortunately, the patches related to detecting the
unmodified content will have to wait until after the
release.
Here's the problem: It's quite easy to add this
checking and recording capability to all fetcher
plugins, fetchlist generation and db update tools, and
I've done this in my local patches. However,
after a while I discovered a serious problem in the
way Nutch currently manages "phasing out" of old
segment data. If we assume that we always refresh
after some fixed interval (30 days, or whatever), then
we can safely delete segments older than 30 days. If
the interval varies, then potentially we could be
stuck with some segments with very old (but still
valid) data. This is very
inefficient, because in a single given segment there
might be only a couple of such pages left after a
while, and the rest of them would have to be removed
again and again by deduplication because newer pages
would exist in newer segments. Moreover (and this is
the worst problem) if such segments are lost, the
information in webdb must be updated in a way to force
refetching, even though the "If-Modified-Since" or the
MD5 points out that the page is still unchanged since
the last fetching. Currently the only way to do this
is to "add days" - but if we use a variable refetch
interval then it doesn't make much sense. I think we
need to track in a better way which pages are
"missing" from the segments, and have to be
re-fetched, or to have a better DB update mechanism if
we lose some segments.
Perhaps we should extend the Page to record which
segment holds the latest version of the page? But
segments don't have unique ID's now (a directory name
is too fragile and too easily changed)  Related
question: in the FetchListEntry we have a "fetch"
flag. I think that after minor modifications of the
FetchListTool (to generate only entries, which we are
supposed to fetch) we could get rid of this flag, or
change its semantics to mean "unconditionally fetch,
even if unmodified".
====================================================

--- Stefan Groschupf <[EMAIL PROTECTED]> wrote:

> > to detect the unmodified content of a target page
> by
> > looking for its' content MD5 hash value; somehow,
> it
> > is not merged to branch yet; I implemented patch
> 61
> > for my local development, but no further testing
> yet;
> 
> Michael, I really would love to see this patch in
> the sources,  
> however Andrzej Bialecki  suggested some
> improvements.
> Can you realize this improvements against the
> actually sources?  I  
> would vote for the improved patch and I guess a lot
> of other peoples  
> find this improved patch very useful as well.
> 
> THANKS!
> Stefan
> 
> 

__________________________________ 
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com

Re: Re-crawling or what?

Reply via email to