[
https://issues.apache.org/jira/browse/TIKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733819#comment-13733819
]
Hoss Man commented on TIKA-1134:
--------------------------------
The crux of my initial confusion and continuted concern about needing this
documented so future users avoid this same confusion comes from these types of
statements Uwe made...
bq. ... consumers that are only interested in the plain text contents of parsed
files, should ignore all HTML syntax elements and just treat
ignorableWhitespace as significant
...and over in SOLR-4679, uwe added...
bq. ... "ignoreable whitespace" is XML semantics only, in (X)HTML this does not
exist (it is handled differently, but is never reported by HTML parsers), so
the idea in TIKA is to "reuse" (its a bit "incorrect") the ignoreableWhitespace
SAX event to report this "added whitespace".
As someone who is not a Tika expert, or an XHTML expert, or even an HTML expert
-- i have no way of knowing any of this information if i'm trying to
build/maintain a custom ContentHandler to parse out specific bits of
information from arbitrary files.
In this specific case, i'm maintaining a ContentHandler used in Tika that
attempts to be very generic and agnostic to the types of files that get parsed
-- so even if I was an HTML expert and understood that "ignoreable whitespace"
isn't really an (X)HTML concept, i wouldn't know if/when i should assume that
was relevent in building a custom ContentHandler for Tika, because all i have
to go on is the general information that tika handles the arbitrary file
parsing for me and generates SAX events combined with the
org.xml.sax.ContentHandler javadocs -- which might then lead me to the XML
specs explanation of what ignorableWHitespace is, which lead me to (seemingly
reasonably) assume that if Tika is taking care of parsing file type $foo and
mapping that to SAX events, i probably don't want ignorableWHitespace -- but
the truth is i do for some (all?) file types.
that part isn't clear in any docs i've seen, and should probably be made clear
somewhere.
> ContentHandler gets ignorable whitespace for <br> tags when parsing HTML
> ------------------------------------------------------------------------
>
> Key: TIKA-1134
> URL: https://issues.apache.org/jira/browse/TIKA-1134
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Hoss Man
> Attachments: TIKA-1134.patch
>
>
> I'm not very knowledgable about Tika, so it's possible iI'm missunderstanding
> something here, but it appears that the way Tika parses HTML to produce XHTML
> SAX events is missinterpreting "<br>" tags as equivilent to ignorable
> whitespace containing a newline. This means that clients who ask Tika to
> parse files, and specify their own ContentHandler to capture the character
> data can get sequences of run-on text w/o knowing that the "<br>" tag was
> present -- _unless_ they explicitly handle ignorbaleWhitespace and treat it
> as "real" whitespace -- but this creates a catch-22 if you really do want to
> ignore the ignorable whitespace in the HTML markup.
> The crux of the problem seems to be:
> * instead of generating a startElement event for "br" the HtmlParser treats
> it as a xhtml.newline().
> * xhtml.newline() generates and ignorableWhitespace SAX event instead of a
> characters SAX event
> ...either one of these by themselves might be fine, but in combination they
> don't really make any sense. If for example an actual newline exists in the
> html, it comes across as part of a characters SAX event, not as ignorbale
> whitespace.
> Changing the newline() function to delegate to characters(...) seems to solve
> the problem for <br> tags in HTML, but breaks several tests -- probably
> because the newline() function is also used to add intentionally add
> (synthetic) ignorableWhitespace events after elements.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira