On 2010-08-15 20:01, Ken Krugler wrote:

* does this include image maps as well (<area>)?

I've got a patch for that (the same one that does iframes). Hopefully
I'll commit that today.

Cool.


* how does the code treat invalid html with both body and frameset?

TagSoup should clean up the invalid HTML.

The issue you'd run into with <body><frameset> is that TagSoup maps it
to an empty <body />, followed by <frameset>...</frameset>.

I committed a patch that fixes this, at least for the examples that I
tried (including the one that Julien reported).

Great, that was one example of invalid HTML from our parse-html tests.


* what's the status of extracting the meta robots and link rel
information?

All <meta> elements are now emitted in the resulting <head> element.

And <link> and <base> elements should be passed through.

Sounds great.


It would be great to get input on just how "fixed" things are now, or
maybe after the next patch gets committed.

We have a set of torture tests that we subjected parse-html to... ;) we'll see how Tika fares now. Overall this sounds like a great progress!

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to