[jira] [Updated] (NUTCH-1060) URL filters to produce regexes to be used by OutlinkExtractor.
[ https://issues.apache.org/jira/browse/NUTCH-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1060: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 URL filters to produce regexes to be used by OutlinkExtractor. -- Key: NUTCH-1060 URL: https://issues.apache.org/jira/browse/NUTCH-1060 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Fix For: 1.6 The problem: OutlinkExtractor produces many URL's from plain text using an advanced regular expression: {code} ([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@~=%-]{0,1000}))?) {code} This expression does not take into account the various non-regex-based URL filters such as prefix, domain and suffix and thus produces URL's that are going to be filtered out by some filter. This, however, becomes a problem when parsing millions of documents that are being processed by the OutlinkExtractor (when case parse-html|parse-tika do not produce any outlinks). Large bodies of full text usually contain a lot of sequences that are extracted as URL's. Many of which are thought to be part of an URI schema such as: id:123 says:what user:doe update:tue-19-jul The above examples can be easily remedied by using a configured prefix URL filter. It may, however, be an even better idea to prevent the extraction of these URL's at the first place. No extraction means filtering less URL's and potentially saving a lot of data. Comments? I'll see if i can produce a patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1060) URL filters to produce regexes to be used by OutlinkExtractor.
[ https://issues.apache.org/jira/browse/NUTCH-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1060: - Fix Version/s: (was: 1.4) (was: nutchgora) 1.5 URL filters to produce regexes to be used by OutlinkExtractor. -- Key: NUTCH-1060 URL: https://issues.apache.org/jira/browse/NUTCH-1060 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Fix For: 1.5 The problem: OutlinkExtractor produces many URL's from plain text using an advanced regular expression: {code} ([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@~=%-]{0,1000}))?) {code} This expression does not take into account the various non-regex-based URL filters such as prefix, domain and suffix and thus produces URL's that are going to be filtered out by some filter. This, however, becomes a problem when parsing millions of documents that are being processed by the OutlinkExtractor (when case parse-html|parse-tika do not produce any outlinks). Large bodies of full text usually contain a lot of sequences that are extracted as URL's. Many of which are thought to be part of an URI schema such as: id:123 says:what user:doe update:tue-19-jul The above examples can be easily remedied by using a configured prefix URL filter. It may, however, be an even better idea to prevent the extraction of these URL's at the first place. No extraction means filtering less URL's and potentially saving a lot of data. Comments? I'll see if i can produce a patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira