hi Andrzej:
thanks your reply.
Yes, I saw the unmodified page content stored in the
parse_data/ and parse_text/ within new fetched
segments/. I even print out the new fetched content
MD5 signature in
http.java(56eae3c2556cb10a00e7346738dcb318) which is
matched with the one associated with same URL in
fetchlist.
Several concerns I have:
1) Where is flag "If-Modified-Since" set? I didn't see
it in any core code of fetcher and db..
2) I saw the logic goes to "code == 200", in http.java
so that I can see the content MD5. Does that mean
protocol actually sent back content? So, it doesn't
notice If-Modified-Since flag?
If it skip the response, as you said, should return
304.
3) While I patched nutch 61 in nutch 07, I didn't see
the exactly old code matched with your nutch 61 patch.
For example, in nutch 07 of
"
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java
"
I didn't see
"
get.setHttp11(false);
get.setMethodRetryHandler(null);
"
But these two lines are in your nutch 61 diff.
Will that cause the problem.
thanks,
Michael Ji,
--- Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Michael Ji wrote:
> > hi Andrzej:
> >
> > Thanks for your correction. The patch is compiled
> > successfully and running well in Nutch 07.
> >
> > Just a curious question:
> >
> > As stated in nutch 61:
> > "...if content is unmodified it doesn't have to be
> > fetched and processed..."
> >
> > And I did test for refetching a page without
> content
> > modification and Nutch 6.1 DID parsing this page
> to
> > content/, parse_data/, and parse_text/
> >
>
> Are you sure the plugin retrieved the page content
> once again from the
> server? Because I use "If-Modified-Since", which
> means that if the
> content is unmodified the server should NOT send the
> page once again,
> just a status 304.
>
> > I took look at code:
> >
> > In Fetcher.java,
> > "
> > ProtocolOutput output =
> > protocol.getProtocolOutput(fle);
> > ProtocolStatus pstat = output.getStatus();
> > :
> > switch ( pstat ) {
> > :
> > :
> > case ProtocolStatus.NOTMODIFIED:
>
> > handleFetch(fle, output);
> > break;
> > :
> > :
> > }
> > "
> >
> > Should we just do nothing in case of NOTMODIFIED,
> > which is the flag set when content.MD5 = page.MD5
> in
> > protocol.http.java?
> >
>
> We can't do nothing - we need to report the status.
> Even when we report
> an error, an additional record is written to
> segments...
>
> > The handleFetch() actually parsing and output data
> > structure to segments/.
>
> Yes, that's correct - this was a conscious decision.
> The reason is that
> the server may return other interesting information
> in headers, which
> some of the parsing plugins or FetchSchedule
> implementations may need.
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _
> __________________________________
> [__ || __|__/|__||\/| Information Retrieval,
> Semantic Web
> ___|||__|| \| || | Embedded Unix, System
> Integration
> http://www.sigram.com Contact: info at sigram dot
> com
>
>
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com