I'm trying to configure Nutch to index pages/files that are truncated (in
addition to the successful non-truncated files).

I'm using the okhttp protocol, because I don't think the http protocol
stores truncation information.

I'm using parse-tika, and the "parser.skip.truncated" is set to
default=true.

The particular PDF that I'm experimenting with is returned chunked with gz
compression.  There is no length header in the response.

For this PDF, okhttp correctly marks it as truncated, but then the file is
sent to parsetika, which throws a parse exception. The file is then not
sent to the index.

If I understand correctly, ParseSegment is checking for truncation, but it
requires a Content-Length header to work. In my case, there is no
Content-Length header, so it assumes the file is not truncated.

Should I open a ticket to have ParseSegment also check for okhttp's header (
http.content.truncated=true)?

Is there a way to index files even if they are truncated or if there is a
parse exception?

If indexing is a bridge too far, what's the most efficient way to dump a
list of urls that are truncated and/or had a parse exception?

Thank you!

Best,

      Tim

Reply via email to