Benjamin Higgins wrote:
I was taking a look at HtmlParser.java, and I think the fix to NUTCH-17 was
accidentally removed. See:
http://svn.apache.org/viewvc/lucene/nutch/tags/release-0.8/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?view=log
Specifically, in revision 160319, among other things, DOMFragmentParser was
changed to DOMParser, because, in the comment to that revision:
Changed to use NekoHTML's DOMParser instead of its DOMFragmentParser.
For some reason, the DOMFragmentParser can be very slow with large
documents while the DOMParser has no problems with these. Also added
a main() that permits easier debugging.
However, in 179436, a big patch that included TagSoup among other things,
the change to DOMParser seems to have been lost.
I bring this up because I am having the exact same problem as described in
NUTCH-17. I am using Neko 0.9.4. It occurs on some particularly long
documents. The fetcher simply hangs. If I wait a few hours it will resume
again. The HTML is nothing special; in fact, it's just a bunch of text
(html escaped ie < > & chars converted) inside a <pre> tag.
Comments?
I cannot comment on the issue itself, but if you can submit a patch
(perhaps with testcase that demonstrates this) then it will be easier to
act on.
--
Sami Siren