Javascript parser creates some fairly bogus URLs
------------------------------------------------
Key: NUTCH-364
URL: http://issues.apache.org/jira/browse/NUTCH-364
Project: Nutch
Issue Type: Bug
Affects Versions: 0.8
Environment: OS X 10.4.7
Reporter: Doug Cook
If one crawls, say,
http://www.metropoleparis.com/2000/501/
with the Javascript parser enabled, one gets outlinks of the form:
2006-09-08 16:55:06,301 DEBUG js.JSParseFilter - - outlink from JS:
'http://www.metropoleparis.com/2000/501/</IFRAME>'
2006-09-08 16:55:06,302 DEBUG js.JSParseFilter - - outlink from JS:
'http://www.metropoleparis.com/2000/501/</SCR'
2006-09-08 16:55:06,302 DEBUG js.JSParseFilter - - outlink from JS:
'http://www.metropoleparis.com/2000/501/</DIV>'
Another example would be:
http://www.wein-plus.de/glossar/G.htm
which yields the URL (among others):
2006-09-08 16:55:10,499 DEBUG js.JSParseFilter - - outlink from JS:
'http://www.wein-plus.de/glossar/<\/a>'
I have seen these form "crawler traps" and make small sites explode to many,
many URLs. For the moment, I have the worst offenders plugged with specific
filter rules, but it would be nice to see if there is a way to improve the
JSParseFilter's heuristics to avoid these.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira