Sami Siren wrote:
> Benjamin Higgins wrote:
>> Comments?
>
> I cannot comment on the issue itself, but if you can submit a patch 
> (perhaps with testcase that demonstrates this) then it will be easier 
> to  act on.

Benjamin,

Could you please send me a copy of the offending HTML for testing (off 
the list)?

A little background: I knew of this issue when I changed the API to use 
DocumentFragment. However, as far as I was able to test it with the most 
recent version of Neko at that time, it didn't exhibit this problem.

The main motivation for this was to enable better parsing of broken 
documents with multiple <html> tags (or no <html> at all, but <head> and 
<body> as "root" elements). While this is not possible using a Document, 
it is possible to do this using a DocumentFragment (which doesn't 
necessarily have to represent any well-formed XML tree; and 
specifically, it doesn't require that there is a single root node - 
please see the Javadoc of org.w3c.dom.DocumentFragment for longer 
explanation).

So, if we change it back to Document we will lose this functionality, 
and some pages will be severely truncated, because in such cases 
NekoHTML takes only the first "pseudo-root" node and discards all 
others. However, if you are dealing mostly with well-formed documents 
you may not need this ...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to