[
https://issues.apache.org/jira/browse/NUTCH-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882886#comment-13882886
]
Markus Jelsma commented on NUTCH-1414:
--------------------------------------
Hi Luke,
* We send it to Solr using protected SimpleDateFormat formattedDate = new
SimpleDateFormat("yyyy-MM-dd'T00:00:00'Z"); That is the format Solr/Lucene
expects to get.
* Yes, i did. I separated the tool from Nutch and made some small changes, one
of the notable changes is that extracting a date from the URL has the
preference by default. You do have to expand the regex' a bit to ignore false
dates in URL's.
* Makes sense. I limited the size to a) prevent the regular expressions to
choke on very large pages and b) to ignore dates that do not represent the
article or published date. This is also the reason we're not using this anymore
but have tied it into our text extraction tool. There it knows the context of a
page so it won't yield many false positives.
* No not likely, but the patch works so you should not have much trouble using
it. It might get committed if enough users express their interest.
> Date extraction parse filter
> ----------------------------
>
> Key: NUTCH-1414
> URL: https://issues.apache.org/jira/browse/NUTCH-1414
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1414-1.6-1-testdata.patch, NUTCH-1414-1.6-1.patch
>
>
> Date extraction parse filter for Nutch to provide means to extract an
> arbitrary page date (article date) from the parse text.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)