Re: Parse-tika ignores too much data...

2010-07-08 Thread Ken Krugler
On Jul 8, 2010, at 12:15am, Andrzej Bialecki wrote: On 2010-07-07 22:32, Ken Krugler wrote: Hi Julien, See https://issues.apache.org/jira/browse/TIKA-457 for a description of one of the cases found by Andrzej. There seems to be something very wrong with the way is handled, we also saw ca

Re: Parse-tika ignores too much data...

2010-07-08 Thread Andrzej Bialecki
On 2010-07-07 22:32, Ken Krugler wrote: Hi Julien, See https://issues.apache.org/jira/browse/TIKA-457 for a description of one of the cases found by Andrzej. There seems to be something very wrong with the way is handled, we also saw cases were it was twice in the output. Don't know about th

Re: Parse-tika ignores too much data...

2010-07-07 Thread Julien Nioche
Hi Ken, Thank you for your comments and analysis. We should probably modify the HTMLHandler so that it does not discard a frameset because of the bodylevel being equal to 0. I suggested earlier on the Tika list having a mechanism for specifying a custom handler via the Context, that would give us

Re: Parse-tika ignores too much data...

2010-07-07 Thread Ken Krugler
Hi Julien, See https://issues.apache.org/jira/browse/TIKA-457 for a description of one of the cases found by Andrzej. There seems to be something very wrong with the way is handled, we also saw cases were it was twice in the output. Don't know about the case of it appearing twice. But f

Re: Parse-tika ignores too much data...

2010-07-07 Thread Julien Nioche
Ken, See https://issues.apache.org/jira/browse/TIKA-457 for a description of one of the cases found by Andrzej. There seems to be something very wrong with the way is handled, we also saw cases were it was twice in the output. J. On 7 July 2010 17:41, Ken Krugler wrote: > Hi Andrzej, > > I've

Re: Parse-tika ignores too much data...

2010-07-07 Thread Ken Krugler
Hi Andrzej, I've got a old list of cases where Tika was not extracting links: - frame - iframe - img - map - object - link (only in section) I worked around this in my crawling code, by directly processing the DOM, but I should roll this into Tika. If you have a list of problems with

Parse-tika ignores too much data...

2010-07-07 Thread Andrzej Bialecki
Hi, I'm going through NUTCH-840, and I tried to eat our own dog food, i.e. prepare the test DOM-s with Tika's HtmlParser. Results are not so good for some test cases... Even when using IdentityHtmlMapper Tika ignores some elements (such as frame/frameset) and for some others (area) it drops