Hi all,
Digging deeper, the current behavior seems to be causing problems that
were not evident in Tika 0.7. We noticed this when switching the Bixo
code to use Tika 0.8-SNAPSHOT.
For example, if you have a document that looks like:
<html>
<head>
<meta http-equiv="content-type" content="text/html;
charset=utf-8">
<title>Some Title</title>
</head>
<body>
...
</html>
The lazyStartDocument() method is called when the <meta> tag is
encountered by HtmlHandler, because it calls xhtml.startElement() with
the meta tag.
Since this is before <title> has been seen, the output generated has
an empty <title> element. And that causes a bunch of problems for our
tests.
I believe this (and the previous problem I'd reported) is a side-
effect of TIKA-379, which Chris M. rolled in during change 949635.
Unfortunately I think lazyStartDocument() needs to be re-thought. A
rough proposal would be:
1. HtmlHandler should call xhtml start/endElement for all elements,
versus creating a fragile implicit dependency between its behavior and
that of XHTMLContentHandler.
2. In XHTMLContentHandler, the elements received should be queued up
until endElement() is called for <head>, or startElement() is called
for <body>, or endDocument() is called.
-- Ken
On Aug 10, 2010, at 7:53pm, Ken Krugler wrote:
Hi all,
I was trying to debug why my fix for a problem with the Boilerpipe
integration wasn't working, and came across
XHTMLContentHandler.lazyStartDocument().
This, when used by HtmlHandler, essentially skips calling the user-
provided content handler for the initial element tags (html, head,
body) until it looks like there's a reason to generate content. Then
it calls the content handler with no-attribute versions of these
elements, so attributes in elements like <html lang="en"> will get
stripped. Which seems like not a great thing, especially given
ongoing work to make it easier to pass everything through if that's
what's needed.
But the problem I ran into was with this sequence:
<html>
<head>
<title>xxx</title>
<meta blah>
</head>
<body>
...
</body>
</html>
The problem is that this call to lazyStartDocument()is made when the
<meta> element is encountered. So what the content handler gets
called with is:
<html>
<head>
<title>xxx</title>
</head>
<body>
and then <meta>
So the <meta> element is getting passed through after the <body>
element. And that in turn prevents Boilerpipe from behaving as
expected.
But before I dive in here and start filing issues/hacking on the
code, I'm wondering if somebody (OK, Jukka) can provide some color
commentary.
Thanks,
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g