hi Andrzej:
Thanks for your correction. The patch is compiled
successfully and running well in Nutch 07.
Just a curious question:
As stated in nutch 61:
"...if content is unmodified it doesn't have to be
fetched and processed..."
And I did test for refetching a page without content
modification and Nutch 6.1 DID parsing this page to
content/, parse_data/, and parse_text/
I took look at code:
In Fetcher.java,
"
ProtocolOutput output =
protocol.getProtocolOutput(fle);
ProtocolStatus pstat = output.getStatus();
:
switch ( pstat ) {
:
:
case ProtocolStatus.NOTMODIFIED:
handleFetch(fle, output);
break;
:
:
}
"
Should we just do nothing in case of NOTMODIFIED,
which is the flag set when content.MD5 = page.MD5 in
protocol.http.java?
The handleFetch() actually parsing and output data
structure to segments/.
Thanks,
Michael Ji,
--- Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Michael Ji wrote:
> > "
> > FetchListEntry value = new FetchListEntry();
> > Page page = (Page)value.getPage().clone();
> > "
> >
> > Seems value is an empty FetchListEntry instance.
> Will
> > that cause clone getPage failure coz it is NULL?
>
> Please try to replace this logic with the following:
>
> FetchListEntry value = new
> FetchListEntry();
> while (topN > 0 && reader.next(key,
> value)) {
> Page page = value.getPage();
> if (page != null) {
> Page p = new Page();
> p.set(page);
> page = p;
> }
> if (forceRefetch) {
> Page p = value.getPage();
> // reset fetchTime and MD5,
> so that the content will
> // always be new and unique.
> p.setNextFetchTime(0L);
>
> p.setMD5(MD5Hash.digest(p.getURL().toString()));
> }
> tables.append(value);
> topN--;
>
>
> This patchset still needs a lot of thought and work.
> Even the part that
> avoids re-fetching unmodified content needs
> additional thinking - it's
> easy to end up in a state, where Nutch cannot be
> forced to re-fetch the
> page because every time you try it remains
> unmodified - but you need
> refetching the actual data because e.g. you lost
> that segment data...
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _
> __________________________________
> [__ || __|__/|__||\/| Information Retrieval,
> Semantic Web
> ___|||__|| \| || | Embedded Unix, System
> Integration
> http://www.sigram.com Contact: info at sigram dot
> com
>
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com