Re: Tika HTML parsing

Andrzej Bialecki Sun, 15 Aug 2010 00:04:54 -0700

On 2010-08-15 06:54, Ken Krugler wrote:

For what it's worth, I just committed some patches to Tika that should
improve Tika's ability to extract HTML outlinks (in <img> and <frame>
elements, at least). Support for <iframe> should be coming soon :)


This is in 0.8-SNAPSHOT, and there's one troubling parse issue I'm
tracking down, but I think Tika is getting closer to being usable by
Nutch for typical web crawling.


Thanks Ken for pushing forward this work! A few questions:

* does this include image maps as well (<area>)?

* how does the code treat invalid html with both body and frameset?

* what's the status of extracting the meta robots and link rel information?

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Tika HTML parsing

Reply via email to