Re: Tika HTML parsing

Ken Krugler Sun, 15 Aug 2010 11:01:50 -0700

Hi Andrzej,

On Aug 15, 2010, at 12:04am, Andrzej Bialecki wrote:

On 2010-08-15 06:54, Ken Krugler wrote:

For what it's worth, I just committed some patches to Tika thatshould

improve Tika's ability to extract HTML outlinks (in <img> and <frame>
elements, at least). Support for <iframe> should be coming soon :)

This is in 0.8-SNAPSHOT, and there's one troubling parse issue I'm
tracking down, but I think Tika is getting closer to being usable by
Nutch for typical web crawling.


Thanks Ken for pushing forward this work! A few questions:

* does this include image maps as well (<area>)?

I've got a patch for that (the same one that does iframes). HopefullyI'll commit that today.

* how does the code treat invalid html with both body and frameset?


TagSoup should clean up the invalid HTML.

The issue you'd run into with <body><frameset> is that TagSoup maps itto an empty <body />, followed by <frameset>...</frameset>.

I committed a patch that fixes this, at least for the examples that Itried (including the one that Julien reported).

* what's the status of extracting the meta robots and link relinformation?


All <meta> elements are now emitted in the resulting <head> element.

And <link> and <base> elements should be passed through.

It would be great to get input on just how "fixed" things are now, ormaybe after the next patch gets committed.


Thanks,

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Tika HTML parsing

Reply via email to