[jira] [Comment Edited] (NUTCH-1414) Date extraction parse filter

Luke (JIRA) Sun, 26 Jan 2014 04:12:46 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882282#comment-13882282
 ]


Luke edited comment on NUTCH-1414 at 1/26/14 12:10 PM:
-------------------------------------------------------

Hi Markus/Others,

Firstly, let me say I like this functionality and wish it was built into a 
shipped plugin - I'm surprised there isn't more interest in this. Am I missing 
something? Is there a newer/better way of extracting dates from parsed text?

A couple of questions:
* I was wondering if you'd attempted to pass the extracted date to Solr 
(/other) in a date format, rather than as a string? If so, how have you done it?

* Many websites now put the date in the URL (esp. wordpress). eg: /2014/01/26/, 
\-20140126\-, /2014/jan/26/, etc. Did you consider also searching the URL?

* In getFragment() there is this code:
{code}
     // Check if we need to obtain the tail
     if (text.length() > maxFragmentLength + headFragmentLength) {
       tail = text.substring(text.length() - maxFragmentLength);
     }
{code}
I'm not sure that this does what it's meant to.
looking at the code above, this essentially means for there to be a tail, the 
total length has to be {{2 x maxFragmentLength}}.
If {{text.length() > 2 x maxFragmentLength}}, then the fragment is essentially 
of length {{2 x maxFragmentLength}}.
If {{text.length() <= maxFragmentLength}} then {{fragment == text}}
However, if {{maxFragmentLength < text.length() <= 2 x maxFragmentLength}} then 
the fragment is just the head. In this case, it would make sense to have the 
whole text as the fragment. Thus, if there's a date in the tail it may be 
missed for short (but not too short) pages.

* I understand Julien's POV - that this is somewhat micro functionality, 
although handling dates does seem to require quite specific code. I've seen 
discussions elsewhere that suggest implementing a system such as described at 
http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ and whilst 
this seems to be a good option it could not offer the same accuracy as this 
plugin. Is there any chance that this would be promoted to a shipped pluggin? 
What needs to happen to make that happen?



was (Author: lukejira):
Hi Markus/Others,

Firstly, let me say I like this functionality and wish it was built into a 
shipped plugin - I'm surprised there isn't more interest in this. Am I missing 
something? Is there a newer/better way of extracting dates from parsed text?

A couple of questions:
* I was wondering if you'd attempted to pass the extracted date to Solr 
(/other) in a date format, rather than as a string? If so, how have you done it?

* Many websites now put the date in the URL (esp. wordpress). eg: /2014/01/26/, 
\-20140126\-, /2014/jan/26/, etc. Did you consider also searching the URL?

* In getFragment() there is this code:
{code}
     // Check if we need to obtain the tail
     if (text.length() > maxFragmentLength + headFragmentLength) {
       tail = text.substring(text.length() - maxFragmentLength);
     }
{code}
I'm not sure that this does what it's meant to.
looking at the code above, this essentially means for there to be a tail, the 
total length has to be {{2 x maxFragmentLength}}.
If {{text.length() > 2 x maxFragmentLength}}, then the fragment is essentially 
of length {{2 x maxFragmentLength}}.
However, if {{maxFragmentLength < text.length() < 2 x maxFragmentLength}} then 
the fragment is just the head. In this case, it would make sense to have the 
whole text as the fragment. Thus, if there's a date in the tail it may be 
missed for short (but not too short) pages.

* I understand Julien's POV - that this is somewhat micro functionality, 
although handling dates does seem to require quite specific code. I've seen 
discussions elsewhere that suggest implementing a system such as described at 
http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ and whilst 
this seems to be a good option it could not offer the same accuracy as this 
plugin. Is there any chance that this would be promoted to a shipped pluggin? 
What needs to happen to make that happen?


> Date extraction parse filter
> ----------------------------
>
>                 Key: NUTCH-1414
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1414
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.9
>
>         Attachments: NUTCH-1414-1.6-1-testdata.patch, NUTCH-1414-1.6-1.patch
>
>
> Date extraction parse filter for Nutch to provide means to extract an 
> arbitrary page date (article date) from the parse text.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Comment Edited] (NUTCH-1414) Date extraction parse filter

Reply via email to