[ 
https://issues.apache.org/jira/browse/TIKA-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2928.
-------------------------------
    Fix Version/s: 3.0.0-BETA
       Resolution: Fixed

I'm assuming this is fixed with our migration to JSoup on the main/3.x branch.  
Please re-open if it is not fixed.

> Less than sign within tag boundaries considered as start of a new tag.
> ----------------------------------------------------------------------
>
>                 Key: TIKA-2928
>                 URL: https://issues.apache.org/jira/browse/TIKA-2928
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser, server
>    Affects Versions: 1.22
>            Reporter: Desmond David
>            Priority: Minor
>             Fix For: 3.0.0-BETA
>
>
> So I have been attempting to parse some (somewhat non-standard) HTML 
> documents using Tika and I have observed that if the document contains a 
> less-than sign (<) as part of a tag's body, Tika parses it as the start of a 
> new tag and eventually omits the rest of the text in the final document, up 
> to the point when the next newline is to be entered.
> For example, consider the following HTML snippet:
>  
> {code:html}
> <tr ><td > GFR<60 = Chronic Kidney Disease, GFR<15 = Kidney Failure 
> </td></tr><tr ><td ></td></tr><tr ><td > ENZYMES & BILIRUBIN</td></tr>{code}
> The result is:
> {code:java}
> GFR
> ENZYMES & BILIRUBIN
> {code}
> Here, the rest of the content after the first `GFR` gets omitted. Based on 
> this observation I think this means that the `<60`  and it's subsequent 
> characters are getting interpreted as part of a tag, and since are getting 
> ignored. Then at some point, `</td></tr>` is encountered which short-circuits 
> the execution and starts processing the next line.
> This behaviour was observed using both, the Tika App and the Tika Server.
> I think expected behaviour should be that all text within data tags (p, td, 
> etc.) should be considered as raw text. Or at least Tika's behaviour should 
> be configurable to be allowed to do so.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to