Re: Parse-tika ignores too much data...
On 2010-07-07 22:32, Ken Krugler wrote: Hi Julien, See https://issues.apache.org/jira/browse/TIKA-457 for a description of one of the cases found by Andrzej. There seems to be something very wrong with the way body is handled, we also saw cases were it was twice in the output. Don't know about the case of it appearing twice. But for the above issue, I added a comment. The test HTML is badly broken, in that you can either have a body OR a frameset, but not both. The HTML was broken on purpose - one of the goals of the original test was to get as much content and links in presence of grave errors - as you know even major sites often produce a badly broken HTML, but the parser sanitize it and produce a valid DOM. In this case, it produced two nested body elements, which is not valid. I should also mention that NekoHTML handled this test much better, by removing the body and retaining only the frameset. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Parse-tika ignores too much data...
Hi Andrzej, I've got a old list of cases where Tika was not extracting links: - frame - iframe - img - map - object - link (only in head section) I worked around this in my crawling code, by directly processing the DOM, but I should roll this into Tika. If you have a list of problems with test docs, file a TIKA issue and I'll try to fix things up quickly. Thanks, -- Ken On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote: Hi, I'm going through NUTCH-840, and I tried to eat our own dog food, i.e. prepare the test DOM-s with Tika's HtmlParser. Results are not so good for some test cases... Even when using IdentityHtmlMapper Tika ignores some elements (such as frame/ frameset) and for some others (area) it drops the href. As a result, the number of valid outlinks collected with parse-tika is much smaller than with parse-html. I know this issue has been reported (TIKA-379, NUTCH-817, NUTCH-794), and a partial fix was applied to Tika 0.8, but still this won't handle the problems I mentioned above. Can we come up with a plan to address this? I'd rather switch completely to Tika-s HTML parsing, but at the moment we would lose too much useful data... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Parse-tika ignores too much data...
Hi Julien, See https://issues.apache.org/jira/browse/TIKA-457 for a description of one of the cases found by Andrzej. There seems to be something very wrong with the way body is handled, we also saw cases were it was twice in the output. Don't know about the case of it appearing twice. But for the above issue, I added a comment. The test HTML is badly broken, in that you can either have a body OR a frameset, but not both. -- Ken On 7 July 2010 17:41, Ken Krugler kkrugler_li...@transpac.com wrote: Hi Andrzej, I've got a old list of cases where Tika was not extracting links: - frame - iframe - img - map - object - link (only in head section) I worked around this in my crawling code, by directly processing the DOM, but I should roll this into Tika. If you have a list of problems with test docs, file a TIKA issue and I'll try to fix things up quickly. Thanks, -- Ken On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote: Hi, I'm going through NUTCH-840, and I tried to eat our own dog food, i.e. prepare the test DOM-s with Tika's HtmlParser. Results are not so good for some test cases... Even when using IdentityHtmlMapper Tika ignores some elements (such as frame/ frameset) and for some others (area) it drops the href. As a result, the number of valid outlinks collected with parse-tika is much smaller than with parse-html. I know this issue has been reported (TIKA-379, NUTCH-817, NUTCH-794), and a partial fix was applied to Tika 0.8, but still this won't handle the problems I mentioned above. Can we come up with a plan to address this? I'd rather switch completely to Tika-s HTML parsing, but at the moment we would lose too much useful data... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Parse-tika ignores too much data...
Hi Ken, Thank you for your comments and analysis. We should probably modify the HTMLHandler so that it does not discard a frameset because of the bodylevel being equal to 0. I suggested earlier on the Tika list having a mechanism for specifying a custom handler via the Context, that would give us the option in Nutch to implement the logic we want i.e. ignore the body level if we want to. Thanks J. On 7 July 2010 21:32, Ken Krugler kkrugler_li...@transpac.com wrote: Hi Julien, See https://issues.apache.org/jira/browse/TIKA-457 for a description of one of the cases found by Andrzej. There seems to be something very wrong with the way body is handled, we also saw cases were it was twice in the output. Don't know about the case of it appearing twice. But for the above issue, I added a comment. The test HTML is badly broken, in that you can either have a body OR a frameset, but not both. -- Ken On 7 July 2010 17:41, Ken Krugler kkrugler_li...@transpac.com wrote: Hi Andrzej, I've got a old list of cases where Tika was not extracting links: - frame - iframe - img - map - object - link (only in head section) I worked around this in my crawling code, by directly processing the DOM, but I should roll this into Tika. If you have a list of problems with test docs, file a TIKA issue and I'll try to fix things up quickly. Thanks, -- Ken On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote: Hi, I'm going through NUTCH-840, and I tried to eat our own dog food, i.e. prepare the test DOM-s with Tika's HtmlParser. Results are not so good for some test cases... Even when using IdentityHtmlMapper Tika ignores some elements (such as frame/frameset) and for some others (area) it drops the href. As a result, the number of valid outlinks collected with parse-tika is much smaller than with parse-html. I know this issue has been reported (TIKA-379, NUTCH-817, NUTCH-794), and a partial fix was applied to Tika 0.8, but still this won't handle the problems I mentioned above. Can we come up with a plan to address this? I'd rather switch completely to Tika-s HTML parsing, but at the moment we would lose too much useful data... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com