Sami Siren wrote: > Benjamin Higgins wrote: >> Comments? > > I cannot comment on the issue itself, but if you can submit a patch > (perhaps with testcase that demonstrates this) then it will be easier > to act on.
Benjamin, Could you please send me a copy of the offending HTML for testing (off the list)? A little background: I knew of this issue when I changed the API to use DocumentFragment. However, as far as I was able to test it with the most recent version of Neko at that time, it didn't exhibit this problem. The main motivation for this was to enable better parsing of broken documents with multiple <html> tags (or no <html> at all, but <head> and <body> as "root" elements). While this is not possible using a Document, it is possible to do this using a DocumentFragment (which doesn't necessarily have to represent any well-formed XML tree; and specifically, it doesn't require that there is a single root node - please see the Javadoc of org.w3c.dom.DocumentFragment for longer explanation). So, if we change it back to Document we will lose this functionality, and some pages will be severely truncated, because in such cases NekoHTML takes only the first "pseudo-root" node and discards all others. However, if you are dealing mostly with well-formed documents you may not need this ... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
