[ https://issues.apache.org/jira/browse/NUTCH-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1060: --------------------------------- Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 > URL filters to produce regexes to be used by OutlinkExtractor. > -------------------------------------------------------------- > > Key: NUTCH-1060 > URL: https://issues.apache.org/jira/browse/NUTCH-1060 > Project: Nutch > Issue Type: New Feature > Reporter: Markus Jelsma > Fix For: 1.6 > > > The problem: > OutlinkExtractor produces many URL's from plain text using an advanced > regular expression: > {code} > ([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@&~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@&~=%-]{0,1000}))?) > {code} > This expression does not take into account the various non-regex-based URL > filters such as prefix, domain and suffix and thus produces URL's that are > going to be filtered out by some filter. This, however, becomes a problem > when parsing millions of documents that are being processed by the > OutlinkExtractor (when case parse-html|parse-tika do not produce any > outlinks). Large bodies of full text usually contain a lot of sequences that > are extracted as URL's. Many of which are thought to be part of an URI schema > such as: > id:123 > says:what > user:doe > update:tue-19-jul > The above examples can be easily remedied by using a configured prefix URL > filter. It may, however, be an even better idea to prevent the extraction of > these URL's at the first place. No extraction means filtering less URL's and > potentially saving a lot of data. > Comments? I'll see if i can produce a patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira