URL filters to produce regexes to be used by OutlinkExtractor.
--------------------------------------------------------------
Key: NUTCH-1060
URL: https://issues.apache.org/jira/browse/NUTCH-1060
Project: Nutch
Issue Type: New Feature
Reporter: Markus Jelsma
Fix For: 1.4, 2.0
The problem:
OutlinkExtractor produces many URL's from plain text using an advanced regular
expression:
{code}
([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@&~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@&~=%-]{0,1000}))?)
{code}
This expression does not take into account the various non-regex-based URL
filters such as prefix, domain and suffix and thus produces URL's that are
going to be filtered out by some filter. This, however, becomes a problem when
parsing millions of documents that are being processed by the OutlinkExtractor
(when case parse-html|parse-tika do not produce any outlinks). Large bodies of
full text usually contain a lot of sequences that are extracted as URL's. Many
of which are thought to be part of an URI schema such as:
id:123
says:what
user:doe
update:tue-19-jul
The above examples can be easily remedied by using a configured prefix URL
filter. It may, however, be an even better idea to prevent the extraction of
these URL's at the first place. No extraction means filtering less URL's and
potentially saving a lot of data.
Comments? I'll see if i can produce a patch.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira