Sorry Guys. I'm +1 for Ken's proposal, and for potentially including examples of your Bixo tests in the Tika codebase :) Ken, can you attach some of your tests to a new JIRA issue for this and link it to TIKA-379?
Cheers, Chris On 8/11/10 7:19 PM, "Ken Krugler" <[email protected]> wrote: Hi all, Digging deeper, the current behavior seems to be causing problems that were not evident in Tika 0.7. We noticed this when switching the Bixo code to use Tika 0.8-SNAPSHOT. For example, if you have a document that looks like: <html> <head> <meta http-equiv="content-type" content="text/html; charset=utf-8"> <title>Some Title</title> </head> <body> ... </html> The lazyStartDocument() method is called when the <meta> tag is encountered by HtmlHandler, because it calls xhtml.startElement() with the meta tag. Since this is before <title> has been seen, the output generated has an empty <title> element. And that causes a bunch of problems for our tests. I believe this (and the previous problem I'd reported) is a side- effect of TIKA-379, which Chris M. rolled in during change 949635. Unfortunately I think lazyStartDocument() needs to be re-thought. A rough proposal would be: 1. HtmlHandler should call xhtml start/endElement for all elements, versus creating a fragile implicit dependency between its behavior and that of XHTMLContentHandler. 2. In XHTMLContentHandler, the elements received should be queued up until endElement() is called for <head>, or startElement() is called for <body>, or endDocument() is called. -- Ken On Aug 10, 2010, at 7:53pm, Ken Krugler wrote: > Hi all, > > I was trying to debug why my fix for a problem with the Boilerpipe > integration wasn't working, and came across > XHTMLContentHandler.lazyStartDocument(). > > This, when used by HtmlHandler, essentially skips calling the user- > provided content handler for the initial element tags (html, head, > body) until it looks like there's a reason to generate content. Then > it calls the content handler with no-attribute versions of these > elements, so attributes in elements like <html lang="en"> will get > stripped. Which seems like not a great thing, especially given > ongoing work to make it easier to pass everything through if that's > what's needed. > > But the problem I ran into was with this sequence: > > <html> > <head> > <title>xxx</title> > <meta blah> > </head> > <body> > ... > </body> > </html> > > The problem is that this call to lazyStartDocument()is made when the > <meta> element is encountered. So what the content handler gets > called with is: > > <html> > <head> > <title>xxx</title> > </head> > <body> > > and then <meta> > > So the <meta> element is getting passed through after the <body> > element. And that in turn prevents Boilerpipe from behaving as > expected. > > But before I dive in here and start filing issues/hacking on the > code, I'm wondering if somebody (OK, Jukka) can provide some color > commentary. > > Thanks, > > -- Ken > > -------------------------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > e l a s t i c w e b m i n i n g > > > > -------------------------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
