[Nutch-dev] problem with cyberneko html parser?

john Tue, 29 Jun 2004 08:51:39 -0700

I noticed that, when crwaling the same html files repeatedly,
fetcher.java does not always extract identical outlinks.
A further look leads to the finding that cyberneko html parser
appears to "randomly" have attributes in <a>..</a> elements scanned incorrectly.
For example, I have seen attribute <a href=blah> interpreted as <a name=blah>,
or vice versa, but not always. However, the problem is gone if fetcher.java
is run as single thread.


Given its nature, I do expect cyberneko html parser might have different
interpretation of html attributes for non-standard html texts.
But my tests were done with good html texts.

Does anyone else experience the same problem?

John

__________________________________________
http://www.neasys.com - A Good Place to Be
Come to visit us today!


-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - 
digital self defense, top technical experts, no vendor pitches, 
unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] problem with cyberneko html parser?

Reply via email to