Benjamin Higgins wrote: > I was taking a look at HtmlParser.java, and I think the fix to NUTCH-17 was > accidentally removed. See: > > http://svn.apache.org/viewvc/lucene/nutch/tags/release-0.8/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?view=log > > > > Specifically, in revision 160319, among other things, DOMFragmentParser was > changed to DOMParser, because, in the comment to that revision: > > Changed to use NekoHTML's DOMParser instead of its DOMFragmentParser. > For some reason, the DOMFragmentParser can be very slow with large > documents while the DOMParser has no problems with these. Also added > > a main() that permits easier debugging. > > > However, in 179436, a big patch that included TagSoup among other things, > the change to DOMParser seems to have been lost. > > I bring this up because I am having the exact same problem as described in > NUTCH-17. I am using Neko 0.9.4. It occurs on some particularly long > documents. The fetcher simply hangs. If I wait a few hours it will resume > again. The HTML is nothing special; in fact, it's just a bunch of text > (html escaped ie < > & chars converted) inside a <pre> tag. > > Comments?
I cannot comment on the issue itself, but if you can submit a patch (perhaps with testcase that demonstrates this) then it will be easier to act on. -- Sami Siren ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
