Hi all,
I was trying to debug why my fix for a problem with the Boilerpipe
integration wasn't working, and came across
XHTMLContentHandler.lazyStartDocument().
This, when used by HtmlHandler, essentially skips calling the user-
provided content handler for the initial element tags (html, head,
body) until it looks like there's a reason to generate content. Then
it calls the content handler with no-attribute versions of these
elements, so attributes in elements like <html lang="en"> will get
stripped. Which seems like not a great thing, especially given ongoing
work to make it easier to pass everything through if that's what's
needed.
But the problem I ran into was with this sequence:
<html>
<head>
<title>xxx</title>
<meta blah>
</head>
<body>
...
</body>
</html>
The problem is that this call to lazyStartDocument()is made when the
<meta> element is encountered. So what the content handler gets called
with is:
<html>
<head>
<title>xxx</title>
</head>
<body>
and then <meta>
So the <meta> element is getting passed through after the <body>
element. And that in turn prevents Boilerpipe from behaving as expected.
But before I dive in here and start filing issues/hacking on the code,
I'm wondering if somebody (OK, Jukka) can provide some color commentary.
Thanks,
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g