[
https://issues.apache.org/jira/browse/TIKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tyler Palsulich closed TIKA-1134.
---------------------------------
Resolution: Won't Fix
Closing as whitespace given the above conversation. Thanks all!
> ContentHandler gets ignorable whitespace for <br> tags when parsing HTML
> ------------------------------------------------------------------------
>
> Key: TIKA-1134
> URL: https://issues.apache.org/jira/browse/TIKA-1134
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Hoss Man
> Attachments: TIKA-1134.patch
>
>
> I'm not very knowledgable about Tika, so it's possible iI'm missunderstanding
> something here, but it appears that the way Tika parses HTML to produce XHTML
> SAX events is missinterpreting "<br>" tags as equivilent to ignorable
> whitespace containing a newline. This means that clients who ask Tika to
> parse files, and specify their own ContentHandler to capture the character
> data can get sequences of run-on text w/o knowing that the "<br>" tag was
> present -- _unless_ they explicitly handle ignorbaleWhitespace and treat it
> as "real" whitespace -- but this creates a catch-22 if you really do want to
> ignore the ignorable whitespace in the HTML markup.
> The crux of the problem seems to be:
> * instead of generating a startElement event for "br" the HtmlParser treats
> it as a xhtml.newline().
> * xhtml.newline() generates and ignorableWhitespace SAX event instead of a
> characters SAX event
> ...either one of these by themselves might be fine, but in combination they
> don't really make any sense. If for example an actual newline exists in the
> html, it comes across as part of a characters SAX event, not as ignorbale
> whitespace.
> Changing the newline() function to delegate to characters(...) seems to solve
> the problem for <br> tags in HTML, but breaks several tests -- probably
> because the newline() function is also used to add intentionally add
> (synthetic) ignorableWhitespace events after elements.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)