[jira] [Updated] (NUTCH-1060) URL filters to produce regexes to be used by OutlinkExtractor.

Markus Jelsma (Updated) (JIRA) Tue, 03 Apr 2012 05:09:43 -0700

     [ 
https://issues.apache.org/jira/browse/NUTCH-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Markus Jelsma updated NUTCH-1060:
---------------------------------

    Fix Version/s:     (was: 1.5)
                   1.6

20120304-push-1.6
                
> URL filters to produce regexes to be used by OutlinkExtractor.
> --------------------------------------------------------------
>
>                 Key: NUTCH-1060
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1060
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>             Fix For: 1.6
>
>
> The problem:
> OutlinkExtractor produces many URL's from plain text using an advanced 
> regular expression:
> {code}
> ([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@&~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@&~=%-]{0,1000}))?)
> {code}
> This expression does not take into account the various non-regex-based URL 
> filters such as prefix, domain and suffix and thus produces URL's that are 
> going to be filtered out by some filter. This, however, becomes a problem 
> when parsing millions of documents that are being processed by the 
> OutlinkExtractor (when case parse-html|parse-tika do not produce any 
> outlinks). Large bodies of full text usually contain a lot of sequences that 
> are extracted as URL's. Many of which are thought to be part of an URI schema 
> such as:
> id:123
> says:what
> user:doe
> update:tue-19-jul
> The above examples can be easily remedied by using a configured prefix URL 
> filter. It may, however, be an even better idea to prevent the extraction of 
> these URL's at the first place. No extraction means filtering less URL's and 
> potentially saving a lot of data.
> Comments? I'll see if i can produce a patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1060) URL filters to produce regexes to be used by OutlinkExtractor.

Reply via email to