I'm trying to configure Nutch to index pages/files that are truncated (in
addition to the successful non-truncated files).
I'm using the okhttp protocol, because I don't think the http protocol
stores truncation information.
I'm using parse-tika, and the "parser.skip.truncated" is set to
One work around to ignore parse exceptions (at least in the Tika parser):
https://github.com/tballison/nutch/tree/ignore-parse-exception
Proposed fix for truncation checking:
https://github.com/tballison/nutch/tree/okhttp-truncated
On 2023/10/18 14:28:45 Tim Allison wrote:
> I'm trying to
2 matches
Mail list logo