I'm trying to configure Nutch to index pages/files that are truncated (in addition to the successful non-truncated files).
I'm using the okhttp protocol, because I don't think the http protocol stores truncation information. I'm using parse-tika, and the "parser.skip.truncated" is set to default=true. The particular PDF that I'm experimenting with is returned chunked with gz compression. There is no length header in the response. For this PDF, okhttp correctly marks it as truncated, but then the file is sent to parsetika, which throws a parse exception. The file is then not sent to the index. If I understand correctly, ParseSegment is checking for truncation, but it requires a Content-Length header to work. In my case, there is no Content-Length header, so it assumes the file is not truncated. Should I open a ticket to have ParseSegment also check for okhttp's header ( http.content.truncated=true)? Is there a way to index files even if they are truncated or if there is a parse exception? If indexing is a bridge too far, what's the most efficient way to dump a list of urls that are truncated and/or had a parse exception? Thank you! Best, Tim