hi Andrzej:

Thanks for your correction. The patch is compiled
successfully and running well in Nutch 07.

Just a curious question:

As stated in nutch 61:
"...if content is unmodified it doesn't have to be
fetched and processed..."

And I did test for refetching a page without content
modification and Nutch 6.1 DID parsing this page to
content/, parse_data/, and parse_text/

I took look at code: 

In Fetcher.java, 
"
ProtocolOutput output =
protocol.getProtocolOutput(fle);
ProtocolStatus pstat = output.getStatus();
:
switch ( pstat ) {
:
:
    case ProtocolStatus.NOTMODIFIED:                
         handleFetch(fle, output); 
    break;
:
:
}
"

Should we just do nothing in case of NOTMODIFIED,
which is the flag set when content.MD5 = page.MD5 in
protocol.http.java?

The handleFetch() actually parsing and output data
structure to segments/.

Thanks,

Michael Ji,





--- Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

> Michael Ji wrote:
> > "
> > FetchListEntry value = new FetchListEntry();
> > Page page = (Page)value.getPage().clone();
> > "
> > 
> > Seems value is an empty FetchListEntry instance.
> Will
> > that cause clone getPage failure coz it is NULL?
> 
> Please try to replace this logic with the following:
> 
>                  FetchListEntry value = new
> FetchListEntry();
>                  while (topN > 0 && reader.next(key,
> value)) {
>                    Page page = value.getPage();
>                    if (page != null) {
>                      Page p = new Page();
>                      p.set(page);
>                      page = p;
>                    }
>                      if (forceRefetch) {
>                        Page p = value.getPage();
>                        // reset fetchTime and MD5,
> so that the content will
>                        // always be new and unique.
>                        p.setNextFetchTime(0L);
>                       
> p.setMD5(MD5Hash.digest(p.getURL().toString()));
>                      }
>                      tables.append(value);
>                      topN--;
> 
> 
> This patchset still needs a lot of thought and work.
> Even the part that 
> avoids re-fetching unmodified content needs
> additional thinking - it's 
> easy to end up in a state, where Nutch cannot be
> forced to re-fetch the 
> page because every time you try it remains
> unmodified - but you need 
> refetching the actual data because e.g. you lost
> that segment data...
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Reply via email to