[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842291#comment-17842291 ]
Sebastian Nagel commented on NUTCH-3028: ---------------------------------------- +1 lgtm. One question: if there is no parseData, the JEXL expression is not evaluated. Since WARC files may inlcude only the raw HTML plus fetch/capture metadata, successfully parsing a document is not a requirement to archive it in a WARC file. Might be useful to have the JEXL filtering also available for unparsed docs. > WARCExported to support filtering by JEXL > ----------------------------------------- > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.19 > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Priority: Minor > Fix For: 1.21 > > Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch > > > Filtering segment data to WARC is now possible using JEXL expressions. In the > next example, all records with SOME_KEY=SOME_VALUE in their parseData > metadata are exported to WARC. > {color:#000000}-expr > 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)