hi Andrzej:

thanks your reply.

Yes, I saw the unmodified page content stored in the
parse_data/ and parse_text/ within new fetched
segments/. I even print out the new fetched content
MD5 signature in
http.java(56eae3c2556cb10a00e7346738dcb318) which is
matched with the one associated with same URL in
fetchlist.

Several concerns I have:

1) Where is flag "If-Modified-Since" set? I didn't see
it in any core code of fetcher and db..

2) I saw the logic goes to "code == 200", in http.java
so that I can see the content MD5. Does that mean
protocol actually sent back content? So, it doesn't
notice If-Modified-Since flag?

If it skip the response, as you said, should return
304.

3) While I patched nutch 61 in nutch 07, I didn't see
the exactly old code matched with your nutch 61 patch.

For example, in nutch 07 of 
"
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java
"

I didn't see
"
get.setHttp11(false);
get.setMethodRetryHandler(null);
"

But these two lines are in your nutch 61 diff.

Will that cause the problem.


thanks,

Michael Ji,

--- Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

> Michael Ji wrote:
> > hi Andrzej:
> > 
> > Thanks for your correction. The patch is compiled
> > successfully and running well in Nutch 07.
> > 
> > Just a curious question:
> > 
> > As stated in nutch 61:
> > "...if content is unmodified it doesn't have to be
> > fetched and processed..."
> > 
> > And I did test for refetching a page without
> content
> > modification and Nutch 6.1 DID parsing this page
> to
> > content/, parse_data/, and parse_text/
> > 
> 
> Are you sure the plugin retrieved the page content
> once again from the 
> server? Because I use "If-Modified-Since", which
> means that if the 
> content is unmodified the server should NOT send the
> page once again, 
> just a status 304.
> 
> > I took look at code: 
> > 
> > In Fetcher.java, 
> > "
> > ProtocolOutput output =
> > protocol.getProtocolOutput(fle);
> > ProtocolStatus pstat = output.getStatus();
> > :
> > switch ( pstat ) {
> > :
> > :
> >     case ProtocolStatus.NOTMODIFIED:              
>  
> >          handleFetch(fle, output); 
> >     break;
> > :
> > :
> > }
> > "
> > 
> > Should we just do nothing in case of NOTMODIFIED,
> > which is the flag set when content.MD5 = page.MD5
> in
> > protocol.http.java?
> > 
> 
> We can't do nothing - we need to report the status.
> Even when we report 
> an error, an additional record is written to
> segments...
> 
> > The handleFetch() actually parsing and output data
> > structure to segments/.
> 
> Yes, that's correct - this was a conscious decision.
> The reason is that 
> the server may return other interesting information
> in headers, which 
> some of the parsing plugins or FetchSchedule
> implementations may need.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
> 



                
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Reply via email to