Markus Jelsma created NUTCH-2865:
------------------------------------
Summary: WARC exporter support for metadata and dropping empty
responses
Key: NUTCH-2865
URL: https://issues.apache.org/jira/browse/NUTCH-2865
Project: Nutch
Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Fix For: 1.19
WARCExporter is a handy tool to dump the segments. Unfortunately it also emits
WARC records for status' other than success of notmodified, which accounts for
a decent number in each crawl cycle. It also doesn't emit parsed metadata or
extracted text. It does now.
This patch adds three switches:
* -omitEmptyResponses to only emit records of success or notmodified
* -includeParseData to also emit parse metadata as WARC metadata record
* -includeParseText to also emit extracted text as WARC metadata
Both metadata objects are stored in the same WARC metadata record to safe space.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)