Ken,

See https://issues.apache.org/jira/browse/TIKA-457 for a description of one
of the cases found by Andrzej. There seems to be something very wrong with
the way <body> is handled, we also saw cases were it was twice in the
output.

J.

On 7 July 2010 17:41, Ken Krugler <[email protected]> wrote:

> Hi Andrzej,
>
> I've got a old list of cases where Tika was not extracting links:
>
>  - frame
>  - iframe
>  - img
>  - map
>  - object
>  - link (only in <head> section)
>
> I worked around this in my crawling code, by directly processing the DOM,
> but I should roll this into Tika.
>
> If you have a list of problems with test docs, file a TIKA issue and I'll
> try to fix things up quickly.
>
> Thanks,
>
> -- Ken
>
>
> On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote:
>
>  Hi,
>>
>> I'm going through NUTCH-840, and I tried to eat our own dog food, i.e.
>> prepare the test DOM-s with Tika's HtmlParser.
>>
>> Results are not so good for some test cases... Even when using
>> IdentityHtmlMapper Tika ignores some elements (such as frame/frameset) and
>> for some others (area) it drops the href. As a result, the number of valid
>> outlinks collected with parse-tika is much smaller than with parse-html.
>>
>> I know this issue has been reported (TIKA-379, NUTCH-817, NUTCH-794), and
>> a partial fix was applied to Tika 0.8, but still this won't handle the
>> problems I mentioned above.
>>
>> Can we come up with a plan to address this? I'd rather switch completely
>> to Tika-s HTML parsing, but at the moment we would lose too much useful
>> data...
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>> ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>


-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Reply via email to