Re: Parse-tika ignores too much data...

Julien Nioche Wed, 07 Jul 2010 13:42:21 -0700

Hi Ken,

Thank you for your comments and analysis. We should probably modify the
HTMLHandler so that it does not discard a  frameset because of the bodylevel
being equal to 0. I suggested earlier on the Tika list having a mechanism
for specifying a custom handler via the Context, that would give us the
option in Nutch to implement the logic we want i.e. ignore the body level if
we want to.


Thanks

J.

On 7 July 2010 21:32, Ken Krugler <[email protected]> wrote:

> Hi Julien,
>
> See https://issues.apache.org/jira/browse/TIKA-457 for a description of
> one of the cases found by Andrzej. There seems to be something very wrong
> with the way <body> is handled, we also saw cases were it was twice in the
> output.
>
>
> Don't know about the case of it appearing twice.
>
> But for the above issue, I added a comment. The test HTML is badly broken,
> in that you can either have a <body> OR a <frameset>, but not both.
>
> -- Ken
>
> On 7 July 2010 17:41, Ken Krugler <[email protected]> wrote:
>
>> Hi Andrzej,
>>
>> I've got a old list of cases where Tika was not extracting links:
>>
>>  - frame
>>  - iframe
>>  - img
>>  - map
>>  - object
>>  - link (only in <head> section)
>>
>> I worked around this in my crawling code, by directly processing the DOM,
>> but I should roll this into Tika.
>>
>> If you have a list of problems with test docs, file a TIKA issue and I'll
>> try to fix things up quickly.
>>
>> Thanks,
>>
>> -- Ken
>>
>>
>> On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote:
>>
>>  Hi,
>>>
>>> I'm going through NUTCH-840, and I tried to eat our own dog food, i.e.
>>> prepare the test DOM-s with Tika's HtmlParser.
>>>
>>> Results are not so good for some test cases... Even when using
>>> IdentityHtmlMapper Tika ignores some elements (such as frame/frameset) and
>>> for some others (area) it drops the href. As a result, the number of valid
>>> outlinks collected with parse-tika is much smaller than with parse-html.
>>>
>>> I know this issue has been reported (TIKA-379, NUTCH-817, NUTCH-794), and
>>> a partial fix was applied to Tika 0.8, but still this won't handle the
>>> problems I mentioned above.
>>>
>>> Can we come up with a plan to address this? I'd rather switch completely
>>> to Tika-s HTML parsing, but at the moment we would lose too much useful
>>> data...
>>>
>>> --
>>> Best regards,
>>> Andrzej Bialecki     <><
>>> ___. ___ ___ ___ _ _   __________________________________
>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>> http://www.sigram.com  Contact: info at sigram dot com
>>>
>>>
>> --------------------------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> e l a s t i c   w e b   m i n i n g
>>
>>
>>
>>
>>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>


-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Re: Parse-tika ignores too much data...

Reply via email to