Benjamin Higgins wrote:
> I was taking a look at HtmlParser.java, and I think the fix to NUTCH-17 was
> accidentally removed.  See:
> 
> http://svn.apache.org/viewvc/lucene/nutch/tags/release-0.8/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?view=log
>  
> 
> 
> Specifically, in revision 160319, among other things, DOMFragmentParser was
> changed to DOMParser, because, in the comment to that revision:
> 
> Changed to use NekoHTML's DOMParser instead of its DOMFragmentParser.
> For some reason, the DOMFragmentParser can be very slow with large
> documents while the DOMParser has no problems with these.  Also added
> 
> a main() that permits easier debugging.
> 
> 
> However, in 179436, a big patch that included TagSoup among other things,
> the change to DOMParser seems to have been lost.
> 
> I bring this up because I am having the exact same problem as described in
> NUTCH-17.  I am using Neko 0.9.4.  It occurs on some particularly long
> documents.  The fetcher simply hangs.  If I wait a few hours it will resume
> again.  The HTML is nothing special; in fact, it's just a bunch of text
> (html escaped ie < > & chars converted) inside a <pre> tag.
> 
> Comments?

I cannot comment on the issue itself, but if you can submit a patch 
(perhaps with testcase that demonstrates this) then it will be easier to 
  act on.

--
  Sami Siren


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to