[ 
https://issues.apache.org/jira/browse/NUTCH-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15382987#comment-15382987
 ] 

Markus Jelsma edited comment on NUTCH-1414 at 7/18/16 8:23 PM:
---------------------------------------------------------------

The regex parse filter NUTCH-2227 can grab stuff from HTML. But you still need 
to translate them to proper Solr date formats. You could take the code of 
NUTCH-2227 as an example to find dates and then change them into the proper 
format.

edit: well actually, it is a bad example as it only sets some flag to true. You 
could try modifying this patch to just look for the HTML tags you described, 
that should work better.


was (Author: markus17):
The regex parse filter NUTCH-2227 can grab stuff from HTML. But you still need 
to translate them to proper Solr date formats. You could take the code of 
NUTCH-2227 as an example to find dates and then change them into the proper 
format.

edit: well actually, it is an bad example as it only sets some flag to true. 
You could try modifying this patch to just look for the HTML tags you 
described, that should work better.

> Date extraction parse filter
> ----------------------------
>
>                 Key: NUTCH-1414
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1414
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>         Attachments: NUTCH-1414-1.6-1-testdata.patch, NUTCH-1414-1.6-1.patch
>
>
> Date extraction parse filter for Nutch to provide means to extract an 
> arbitrary page date (article date) from the parse text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to