[ 
https://issues.apache.org/jira/browse/TIKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733819#comment-13733819
 ] 

Hoss Man commented on TIKA-1134:
--------------------------------

The crux of my initial confusion and continuted concern about needing this 
documented so future users avoid this same confusion comes from these types of 
statements Uwe made...

bq. ... consumers that are only interested in the plain text contents of parsed 
files, should ignore all HTML syntax elements and just treat 
ignorableWhitespace as significant

...and over in SOLR-4679, uwe added...

bq. ... "ignoreable whitespace" is XML semantics only, in (X)HTML this does not 
exist (it is handled differently, but is never reported by HTML parsers), so 
the idea in TIKA is to "reuse" (its a bit "incorrect") the ignoreableWhitespace 
SAX event to report this "added whitespace". 

As someone who is not a Tika expert, or an XHTML expert, or even an HTML expert 
-- i have no way of knowing any of this information if i'm trying to 
build/maintain a custom ContentHandler to parse out specific bits of 
information from arbitrary files.  

In this specific case, i'm maintaining a ContentHandler used in Tika that 
attempts to be very generic and agnostic to the types of files that get parsed 
-- so even if I was an HTML expert and understood that "ignoreable whitespace" 
isn't really an (X)HTML concept, i wouldn't know if/when i should assume that 
was relevent in building a custom ContentHandler for Tika, because all i have 
to go on is the general information that tika handles the arbitrary file 
parsing for me and generates SAX events combined with the 
org.xml.sax.ContentHandler javadocs  -- which might then lead me to the XML 
specs explanation of what ignorableWHitespace is, which lead me to (seemingly 
reasonably) assume that if Tika is taking care of parsing file type $foo and 
mapping that to SAX events, i probably don't want ignorableWHitespace -- but 
the truth is i do for some (all?) file types.

that part isn't clear in any docs i've seen, and should probably be made clear 
somewhere.
                
> ContentHandler gets ignorable whitespace for <br> tags when parsing HTML
> ------------------------------------------------------------------------
>
>                 Key: TIKA-1134
>                 URL: https://issues.apache.org/jira/browse/TIKA-1134
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Hoss Man
>         Attachments: TIKA-1134.patch
>
>
> I'm not very knowledgable about Tika, so it's possible iI'm missunderstanding 
> something here, but it appears that the way Tika parses HTML to produce XHTML 
> SAX events is missinterpreting "<br>" tags as equivilent to ignorable 
> whitespace containing a newline.  This means that clients who ask Tika to 
> parse files, and specify their own ContentHandler to capture the character 
> data can get sequences of run-on text w/o knowing that the "<br>" tag was 
> present -- _unless_ they explicitly handle ignorbaleWhitespace and treat it 
> as "real" whitespace -- but this creates a catch-22 if you really do want to 
> ignore the ignorable whitespace in the HTML markup.
> The crux of the problem seems to be:
>  * instead of generating a startElement event for "br" the HtmlParser treats 
> it as a xhtml.newline().
>  * xhtml.newline() generates and ignorableWhitespace SAX event instead of a 
> characters SAX event
> ...either one of these by themselves might be fine, but in combination they 
> don't really make any sense.  If for example an actual newline exists in the 
> html, it comes across as part of a characters SAX event, not as ignorbale 
> whitespace.
> Changing the newline() function to delegate to characters(...) seems to solve 
> the problem for <br> tags in HTML, but breaks several tests -- probably 
> because the newline() function is also used to add intentionally add 
> (synthetic) ignorableWhitespace events after elements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to