[jira] [Updated] (TIKA-1134) ContentHandler gets ignorable whitespace for
tags when parsing HTML

Hoss Man (JIRA) Tue, 11 Jun 2013 11:51:28 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hoss Man updated TIKA-1134:
---------------------------

    Attachment: TIKA-1134.patch

FWIW: Changing the XHTMLContentHandler.newline() function to delegate to 
characters(...) seems to solve the problem for <br> tags in HTML, but breaks 
several tests -- probably because the newline() function is also used to 
intentionally add (synthetic) ignorableWhitespace events after elements.


The attached patch contains some tests that attempt to demonstrate the problem, 
but there are a few important things to note...

 * CustomContentHandlerTest.testBR was my first attempt to demonstrate this 
problem, but for reasons i don't understand my ContentHandler never got any 
character SAX vents from the body of the html.  
CustomContentHandlerTest.testBRBody was my attempt to sanity check myself using 
the BodyContentHandler which *also* didn't work in my small isolated test, for 
reasons i don't understand
 * After I gave up trying to make sense of why CustomContentHandlerTest wasn't 
getting any body content, i instead started hacking away at the existing 
AutoDetectParserTest to see if i could demonstrate the problem there.
 * AutoDetectParserTest.testCustomContentHandler currently fails with a clear 
demonstration of the problem ("Content handler got char sequence 'XY'")
 * AutoDetectParserTest.testBodyContentHandlerBR only passes because ultimately 
BodyContentHandler delegates to ToTextContentHandler which _does_ treat all 
ignorable whitespace as if it's (meaningful) character data ... that seems to 
me like a related bug that has likely been masking this bug for many users who 
just want to extract body text data.





                
> ContentHandler gets ignorable whitespace for <br> tags when parsing HTML
> ------------------------------------------------------------------------
>
>                 Key: TIKA-1134
>                 URL: https://issues.apache.org/jira/browse/TIKA-1134
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Hoss Man
>         Attachments: TIKA-1134.patch
>
>
> I'm not very knowledgable about Tika, so it's possible iI'm missunderstanding 
> something here, but it appears that the way Tika parses HTML to produce XHTML 
> SAX events is missinterpreting "<br>" tags as equivilent to ignorable 
> whitespace containing a newline.  This means that clients who ask Tika to 
> parse files, and specify their own ContentHandler to capture the character 
> data can get sequences of run-on text w/o knowing that the "<br>" tag was 
> present -- _unless_ they explicitly handle ignorbaleWhitespace and treat it 
> as "real" whitespace -- but this creates a catch-22 if you really do want to 
> ignore the ignorable whitespace in the HTML markup.
> The crux of the problem seems to be:
>  * instead of generating a startElement event for "br" the HtmlParser treats 
> it as a xhtml.newline().
>  * xhtml.newline() generates and ignorableWhitespace SAX event instead of a 
> characters SAX event
> ...either one of these by themselves might be fine, but in combination they 
> don't really make any sense.  If for example an actual newline exists in the 
> html, it comes across as part of a characters SAX event, not as ignorbale 
> whitespace.
> Changing the newline() function to delegate to characters(...) seems to solve 
> the problem for <br> tags in HTML, but breaks several tests -- probably 
> because the newline() function is also used to add intentionally add 
> (synthetic) ignorableWhitespace events after elements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-1134) ContentHandler gets ignorable whitespace for tags when parsing HTML

Reply via email to

[jira] [Updated] (TIKA-1134) ContentHandler gets ignorable whitespace for
tags when parsing HTML