Michael Ji wrote:
hi Andrzej:
Thanks for your correction. The patch is compiled
successfully and running well in Nutch 07.
Just a curious question:
As stated in nutch 61:
"...if content is unmodified it doesn't have to be
fetched and processed..."
And I did test for refetching a page without content
modification and Nutch 6.1 DID parsing this page to
content/, parse_data/, and parse_text/
Are you sure the plugin retrieved the page content once again from the
server? Because I use "If-Modified-Since", which means that if the
content is unmodified the server should NOT send the page once again,
just a status 304.
I took look at code:
In Fetcher.java,
"
ProtocolOutput output =
protocol.getProtocolOutput(fle);
ProtocolStatus pstat = output.getStatus();
:
switch ( pstat ) {
:
:
case ProtocolStatus.NOTMODIFIED:
handleFetch(fle, output);
break;
:
:
}
"
Should we just do nothing in case of NOTMODIFIED,
which is the flag set when content.MD5 = page.MD5 in
protocol.http.java?
We can't do nothing - we need to report the status. Even when we report
an error, an additional record is written to segments...
The handleFetch() actually parsing and output data
structure to segments/.
Yes, that's correct - this was a conscious decision. The reason is that
the server may return other interesting information in headers, which
some of the parsing plugins or FetchSchedule implementations may need.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com