Michael Ji wrote:
hi Andrzej:

Thanks for your correction. The patch is compiled
successfully and running well in Nutch 07.

Just a curious question:

As stated in nutch 61:
"...if content is unmodified it doesn't have to be
fetched and processed..."

And I did test for refetching a page without content
modification and Nutch 6.1 DID parsing this page to
content/, parse_data/, and parse_text/


Are you sure the plugin retrieved the page content once again from the server? Because I use "If-Modified-Since", which means that if the content is unmodified the server should NOT send the page once again, just a status 304.

I took look at code: In Fetcher.java, "
ProtocolOutput output =
protocol.getProtocolOutput(fle);
ProtocolStatus pstat = output.getStatus();
:
switch ( pstat ) {
:
:
case ProtocolStatus.NOTMODIFIED: handleFetch(fle, output); break;
:
:
}
"

Should we just do nothing in case of NOTMODIFIED,
which is the flag set when content.MD5 = page.MD5 in
protocol.http.java?


We can't do nothing - we need to report the status. Even when we report an error, an additional record is written to segments...

The handleFetch() actually parsing and output data
structure to segments/.

Yes, that's correct - this was a conscious decision. The reason is that the server may return other interesting information in headers, which some of the parsing plugins or FetchSchedule implementations may need.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to