Hi all,

Digging deeper, the current behavior seems to be causing problems that were not evident in Tika 0.7. We noticed this when switching the Bixo code to use Tika 0.8-SNAPSHOT.

For example, if you have a document that looks like:

<html>
        <head>
                <meta http-equiv="content-type" content="text/html; 
charset=utf-8">
                <title>Some Title</title>
        </head>
        <body>
        ...
</html>

The lazyStartDocument() method is called when the <meta> tag is encountered by HtmlHandler, because it calls xhtml.startElement() with the meta tag.

Since this is before <title> has been seen, the output generated has an empty <title> element. And that causes a bunch of problems for our tests.

I believe this (and the previous problem I'd reported) is a side- effect of TIKA-379, which Chris M. rolled in during change 949635.

Unfortunately I think lazyStartDocument() needs to be re-thought. A rough proposal would be:

1. HtmlHandler should call xhtml start/endElement for all elements, versus creating a fragile implicit dependency between its behavior and that of XHTMLContentHandler.

2. In XHTMLContentHandler, the elements received should be queued up until endElement() is called for <head>, or startElement() is called for <body>, or endDocument() is called.

-- Ken


On Aug 10, 2010, at 7:53pm, Ken Krugler wrote:

Hi all,

I was trying to debug why my fix for a problem with the Boilerpipe integration wasn't working, and came across XHTMLContentHandler.lazyStartDocument().

This, when used by HtmlHandler, essentially skips calling the user- provided content handler for the initial element tags (html, head, body) until it looks like there's a reason to generate content. Then it calls the content handler with no-attribute versions of these elements, so attributes in elements like <html lang="en"> will get stripped. Which seems like not a great thing, especially given ongoing work to make it easier to pass everything through if that's what's needed.

But the problem I ran into was with this sequence:

<html>
        <head>
                <title>xxx</title>
                <meta blah>
        </head>
        <body>
        ...
        </body>
</html>

The problem is that this call to lazyStartDocument()is made when the <meta> element is encountered. So what the content handler gets called with is:

<html>
        <head>
                <title>xxx</title>
        </head>
        <body>

and then <meta>

So the <meta> element is getting passed through after the <body> element. And that in turn prevents Boilerpipe from behaving as expected.

But before I dive in here and start filing issues/hacking on the code, I'm wondering if somebody (OK, Jukka) can provide some color commentary.

Thanks,

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to