[ 
https://issues.apache.org/jira/browse/NUTCH-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15382957#comment-15382957
 ] 

Markus Jelsma edited comment on NUTCH-1414 at 7/18/16 8:06 PM:
---------------------------------------------------------------

It operates on the parsed text or the extracted text if you use an extractor, 
so it will miss dates on most pages or just find the wrong date. This plugin 
only translates 'free text' dates to Date objects and picks the first it finds. 
It does not locate the correct date of the article, which is way more difficult 
and needs much more context than just plain extracted text.

Adding it to plugin.includes and index.parse.md makes it work. Use bin/nutch 
indexchecker command to test what output goes to the search engine.

If this doesn't work for you and you still need it, we can provide a custom 
solution that does better text and date extraction, provides language and 
cookie detection and more.



was (Author: markus17):
It operates on the parsed text or the extracted text if you use an extractor, 
so it will miss dates on most pages or just find the wrong date. This plugin 
only translates 'free text' dates to Date objects and picks the first it finds. 
It does not locate the correct date of the article, which is way more difficult 
and needs much more context than just plain extracted text.

Adding it to plugin.includes and index.parse.md makes it work. Use bin/nutch 
indexchecker command to test what output goes to the search engine.



> Date extraction parse filter
> ----------------------------
>
>                 Key: NUTCH-1414
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1414
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>         Attachments: NUTCH-1414-1.6-1-testdata.patch, NUTCH-1414-1.6-1.patch
>
>
> Date extraction parse filter for Nutch to provide means to extract an 
> arbitrary page date (article date) from the parse text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to