Ken, See https://issues.apache.org/jira/browse/TIKA-457 for a description of one of the cases found by Andrzej. There seems to be something very wrong with the way <body> is handled, we also saw cases were it was twice in the output.
J. On 7 July 2010 17:41, Ken Krugler <[email protected]> wrote: > Hi Andrzej, > > I've got a old list of cases where Tika was not extracting links: > > - frame > - iframe > - img > - map > - object > - link (only in <head> section) > > I worked around this in my crawling code, by directly processing the DOM, > but I should roll this into Tika. > > If you have a list of problems with test docs, file a TIKA issue and I'll > try to fix things up quickly. > > Thanks, > > -- Ken > > > On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote: > > Hi, >> >> I'm going through NUTCH-840, and I tried to eat our own dog food, i.e. >> prepare the test DOM-s with Tika's HtmlParser. >> >> Results are not so good for some test cases... Even when using >> IdentityHtmlMapper Tika ignores some elements (such as frame/frameset) and >> for some others (area) it drops the href. As a result, the number of valid >> outlinks collected with parse-tika is much smaller than with parse-html. >> >> I know this issue has been reported (TIKA-379, NUTCH-817, NUTCH-794), and >> a partial fix was applied to Tika 0.8, but still this won't handle the >> problems I mentioned above. >> >> Can we come up with a plan to address this? I'd rather switch completely >> to Tika-s HTML parsing, but at the moment we would lose too much useful >> data... >> >> -- >> Best regards, >> Andrzej Bialecki <>< >> ___. ___ ___ ___ _ _ __________________________________ >> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >> ___|||__|| \| || | Embedded Unix, System Integration >> http://www.sigram.com Contact: info at sigram dot com >> >> > -------------------------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > e l a s t i c w e b m i n i n g > > > > > -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com

