[
https://issues.apache.org/jira/browse/NUTCH-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15382987#comment-15382987
]
Markus Jelsma edited comment on NUTCH-1414 at 7/18/16 8:23 PM:
---------------------------------------------------------------
The regex parse filter NUTCH-2227 can grab stuff from HTML. But you still need
to translate them to proper Solr date formats. You could take the code of
NUTCH-2227 as an example to find dates and then change them into the proper
format.
edit: well actually, it is a bad example as it only sets some flag to true. You
could try modifying this patch to just look for the HTML tags you described,
that should work better.
was (Author: markus17):
The regex parse filter NUTCH-2227 can grab stuff from HTML. But you still need
to translate them to proper Solr date formats. You could take the code of
NUTCH-2227 as an example to find dates and then change them into the proper
format.
edit: well actually, it is an bad example as it only sets some flag to true.
You could try modifying this patch to just look for the HTML tags you
described, that should work better.
> Date extraction parse filter
> ----------------------------
>
> Key: NUTCH-1414
> URL: https://issues.apache.org/jira/browse/NUTCH-1414
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Reporter: Markus Jelsma
> Attachments: NUTCH-1414-1.6-1-testdata.patch, NUTCH-1414-1.6-1.patch
>
>
> Date extraction parse filter for Nutch to provide means to extract an
> arbitrary page date (article date) from the parse text.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)