[jira] [Updated] (NUTCH-1060) URL filters to produce regexes to be used by OutlinkExtractor.

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1060:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 URL filters to produce regexes to be used by OutlinkExtractor.
 --

 Key: NUTCH-1060
 URL: https://issues.apache.org/jira/browse/NUTCH-1060
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
 Fix For: 1.6


 The problem:
 OutlinkExtractor produces many URL's from plain text using an advanced 
 regular expression:
 {code}
 ([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@~=%-]{0,1000}))?)
 {code}
 This expression does not take into account the various non-regex-based URL 
 filters such as prefix, domain and suffix and thus produces URL's that are 
 going to be filtered out by some filter. This, however, becomes a problem 
 when parsing millions of documents that are being processed by the 
 OutlinkExtractor (when case parse-html|parse-tika do not produce any 
 outlinks). Large bodies of full text usually contain a lot of sequences that 
 are extracted as URL's. Many of which are thought to be part of an URI schema 
 such as:
 id:123
 says:what
 user:doe
 update:tue-19-jul
 The above examples can be easily remedied by using a configured prefix URL 
 filter. It may, however, be an even better idea to prevent the extraction of 
 these URL's at the first place. No extraction means filtering less URL's and 
 potentially saving a lot of data.
 Comments? I'll see if i can produce a patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1060) URL filters to produce regexes to be used by OutlinkExtractor.

2011-09-29 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1060:
-

Fix Version/s: (was: 1.4)
   (was: nutchgora)
   1.5

 URL filters to produce regexes to be used by OutlinkExtractor.
 --

 Key: NUTCH-1060
 URL: https://issues.apache.org/jira/browse/NUTCH-1060
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
 Fix For: 1.5


 The problem:
 OutlinkExtractor produces many URL's from plain text using an advanced 
 regular expression:
 {code}
 ([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@~=%-]{0,1000}))?)
 {code}
 This expression does not take into account the various non-regex-based URL 
 filters such as prefix, domain and suffix and thus produces URL's that are 
 going to be filtered out by some filter. This, however, becomes a problem 
 when parsing millions of documents that are being processed by the 
 OutlinkExtractor (when case parse-html|parse-tika do not produce any 
 outlinks). Large bodies of full text usually contain a lot of sequences that 
 are extracted as URL's. Many of which are thought to be part of an URI schema 
 such as:
 id:123
 says:what
 user:doe
 update:tue-19-jul
 The above examples can be easily remedied by using a configured prefix URL 
 filter. It may, however, be an even better idea to prevent the extraction of 
 these URL's at the first place. No extraction means filtering less URL's and 
 potentially saving a lot of data.
 Comments? I'll see if i can produce a patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira