URL filters to produce regexes to be used by OutlinkExtractor.
--------------------------------------------------------------

                 Key: NUTCH-1060
                 URL: https://issues.apache.org/jira/browse/NUTCH-1060
             Project: Nutch
          Issue Type: New Feature
            Reporter: Markus Jelsma
             Fix For: 1.4, 2.0


The problem:

OutlinkExtractor produces many URL's from plain text using an advanced regular 
expression:

{code}
([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@&~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@&~=%-]{0,1000}))?)
{code}

This expression does not take into account the various non-regex-based URL 
filters such as prefix, domain and suffix and thus produces URL's that are 
going to be filtered out by some filter. This, however, becomes a problem when 
parsing millions of documents that are being processed by the OutlinkExtractor 
(when case parse-html|parse-tika do not produce any outlinks). Large bodies of 
full text usually contain a lot of sequences that are extracted as URL's. Many 
of which are thought to be part of an URI schema such as:

id:123
says:what
user:doe
update:tue-19-jul

The above examples can be easily remedied by using a configured prefix URL 
filter. It may, however, be an even better idea to prevent the extraction of 
these URL's at the first place. No extraction means filtering less URL's and 
potentially saving a lot of data.

Comments? I'll see if i can produce a patch.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to