Javascript parser creates some fairly bogus URLs
------------------------------------------------

                 Key: NUTCH-364
                 URL: http://issues.apache.org/jira/browse/NUTCH-364
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 0.8
         Environment: OS X 10.4.7
            Reporter: Doug Cook


If one crawls, say, 
     http://www.metropoleparis.com/2000/501/

with the Javascript parser enabled, one gets outlinks of the form:
2006-09-08 16:55:06,301 DEBUG js.JSParseFilter -  - outlink from JS: 
'http://www.metropoleparis.com/2000/501/</IFRAME>'
2006-09-08 16:55:06,302 DEBUG js.JSParseFilter -  - outlink from JS: 
'http://www.metropoleparis.com/2000/501/</SCR'
2006-09-08 16:55:06,302 DEBUG js.JSParseFilter -  - outlink from JS: 
'http://www.metropoleparis.com/2000/501/</DIV>'


Another example would be:
http://www.wein-plus.de/glossar/G.htm

which yields the URL (among others):
2006-09-08 16:55:10,499 DEBUG js.JSParseFilter -  - outlink from JS: 
'http://www.wein-plus.de/glossar/<\/a>'

I have seen these form "crawler traps" and make small sites explode to many, 
many URLs. For the moment, I have the worst offenders plugged with specific 
filter rules, but it would be nice to see if there is a way to improve the 
JSParseFilter's heuristics to avoid these.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to