[ 
https://issues.apache.org/jira/browse/NUTCH-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-364.
-------------------------------

    Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

> Javascript parser creates some fairly bogus URLs
> ------------------------------------------------
>
>                 Key: NUTCH-364
>                 URL: https://issues.apache.org/jira/browse/NUTCH-364
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8
>         Environment: OS X 10.4.7
>            Reporter: Doug Cook
>
> If one crawls, say, 
>      http://www.metropoleparis.com/2000/501/
> with the Javascript parser enabled, one gets outlinks of the form:
> 2006-09-08 16:55:06,301 DEBUG js.JSParseFilter -  - outlink from JS: 
> 'http://www.metropoleparis.com/2000/501/</IFRAME>'
> 2006-09-08 16:55:06,302 DEBUG js.JSParseFilter -  - outlink from JS: 
> 'http://www.metropoleparis.com/2000/501/</SCR'
> 2006-09-08 16:55:06,302 DEBUG js.JSParseFilter -  - outlink from JS: 
> 'http://www.metropoleparis.com/2000/501/</DIV>'
> Another example would be:
> http://www.wein-plus.de/glossar/G.htm
> which yields the URL (among others):
> 2006-09-08 16:55:10,499 DEBUG js.JSParseFilter -  - outlink from JS: 
> 'http://www.wein-plus.de/glossar/<\/a>'
> I have seen these form "crawler traps" and make small sites explode to many, 
> many URLs. For the moment, I have the worst offenders plugged with specific 
> filter rules, but it would be nice to see if there is a way to improve the 
> JSParseFilter's heuristics to avoid these.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to