Sorry Guys.

I'm +1 for Ken's proposal, and for potentially including examples of your Bixo 
tests in the Tika codebase :) Ken, can you attach some of your tests to a new 
JIRA issue for this and link it to TIKA-379?

Cheers,
Chris



On 8/11/10 7:19 PM, "Ken Krugler" <[email protected]> wrote:

Hi all,

Digging deeper, the current behavior seems to be causing problems that
were not evident in Tika 0.7. We noticed this when switching the Bixo
code to use Tika 0.8-SNAPSHOT.

For example, if you have a document that looks like:

<html>
        <head>
                <meta http-equiv="content-type" content="text/html; 
charset=utf-8">
                <title>Some Title</title>
        </head>
        <body>
        ...
</html>

The lazyStartDocument() method is called when the <meta> tag is
encountered by HtmlHandler, because it calls xhtml.startElement() with
the meta tag.

Since this is before <title> has been seen, the output generated has
an empty <title> element. And that causes a bunch of problems for our
tests.

I believe this (and the previous problem I'd reported) is a side-
effect of TIKA-379, which Chris M. rolled in during change 949635.

Unfortunately I think lazyStartDocument() needs to be re-thought. A
rough proposal would be:

1. HtmlHandler should call xhtml start/endElement for all elements,
versus creating a fragile implicit dependency between its behavior and
that of XHTMLContentHandler.

2. In XHTMLContentHandler, the elements received should be queued up
until endElement() is called for <head>, or startElement() is called
for <body>, or endDocument() is called.

-- Ken


On Aug 10, 2010, at 7:53pm, Ken Krugler wrote:

> Hi all,
>
> I was trying to debug why my fix for a problem with the Boilerpipe
> integration wasn't working, and came across
> XHTMLContentHandler.lazyStartDocument().
>
> This, when used by HtmlHandler, essentially skips calling the user-
> provided content handler for the initial element tags (html, head,
> body) until it looks like there's a reason to generate content. Then
> it calls the content handler with no-attribute versions of these
> elements, so attributes in elements like <html lang="en"> will get
> stripped. Which seems like not a great thing, especially given
> ongoing work to make it easier to pass everything through if that's
> what's needed.
>
> But the problem I ran into was with this sequence:
>
> <html>
>       <head>
>               <title>xxx</title>
>               <meta blah>
>       </head>
>       <body>
>       ...
>       </body>
> </html>
>
> The problem is that this call to lazyStartDocument()is made when the
> <meta> element is encountered. So what the content handler gets
> called with is:
>
> <html>
>       <head>
>               <title>xxx</title>
>       </head>
>       <body>
>
> and then <meta>
>
> So the <meta> element is getting passed through after the <body>
> element. And that in turn prevents Boilerpipe from behaving as
> expected.
>
> But before I dive in here and start filing issues/hacking on the
> code, I'm wondering if somebody (OK, Jukka) can provide some color
> commentary.
>
> Thanks,
>
> -- Ken
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to