[
https://issues.apache.org/jira/browse/NUTCH-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882282#comment-13882282
]
Luke commented on NUTCH-1414:
-----------------------------
Hi Markus/Others,
Firstly, let me say I like this functionality and wish it was built into a
shipped plugin - I'm surprised there isn't more interest in this. Am I missing
something? Is there a newer/better way of extracting dates from parsed text?
A couple of questions:
* I was wondering if you'd attempted to pass the extracted date to Solr
(/other) in a date format, rather than as a string? If so, how have you done it?
* Many websites now put the date in the URL (esp. wordpress). eg: /2014/01/26/,
-20140126-, /2014/jan/26/, etc. Did you consider also searching the URL?
* In getFragment() there is this code:
{code}
// Check if we need to obtain the tail
if (text.length() > maxFragmentLength + headFragmentLength) {
tail = text.substring(text.length() - maxFragmentLength);
}
{code}
I'm not sure that this does what it's meant to.
looking at the code above, this essentially means for there to be a tail, the
total length has to be {{2 x maxFragmentLength}}.
If {{text.length() > 2 x maxFragmentLength}}, then the fragment is essentially
of length {{2 x maxFragmentLength}}.
However, if {{maxFragmentLength < text.length() < 2 x maxFragmentLength}} then
the fragment is just the head. In this case, it would make sense to have the
whole text as the fragment. Thus, if there's a date in the tail it may be
missed for short (but not too short) pages.
* I understand Julien's POV - that this is somewhat micro functionality,
although handling dates does seem to require quite specific code. I've seen
discussions elsewhere that suggest implementing a system such as described at
http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ and whilst
this seems to be a good option it could not offer the same accuracy as this
plugin. Is there any chance that this would be promoted to a shipped pluggin?
What needs to happen to make that happen?
> Date extraction parse filter
> ----------------------------
>
> Key: NUTCH-1414
> URL: https://issues.apache.org/jira/browse/NUTCH-1414
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1414-1.6-1-testdata.patch, NUTCH-1414-1.6-1.patch
>
>
> Date extraction parse filter for Nutch to provide means to extract an
> arbitrary page date (article date) from the parse text.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)