date:20231018

truncation, parsing and indexing?

2023-10-18 Thread Tim Allison

I'm trying to configure Nutch to index pages/files that are truncated (in addition to the successful non-truncated files). I'm using the okhttp protocol, because I don't think the http protocol stores truncation information. I'm using parse-tika, and the "parser.skip.truncated" is set to

Re: truncation, parsing and indexing?

2023-10-18 Thread Tim Allison

One work around to ignore parse exceptions (at least in the Tika parser): https://github.com/tballison/nutch/tree/ignore-parse-exception Proposed fix for truncation checking: https://github.com/tballison/nutch/tree/okhttp-truncated On 2023/10/18 14:28:45 Tim Allison wrote: > I'm trying to