[
https://issues.apache.org/jira/browse/TIKA-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison resolved TIKA-2928.
-------------------------------
Fix Version/s: 3.0.0-BETA
Resolution: Fixed
I'm assuming this is fixed with our migration to JSoup on the main/3.x branch.
Please re-open if it is not fixed.
> Less than sign within tag boundaries considered as start of a new tag.
> ----------------------------------------------------------------------
>
> Key: TIKA-2928
> URL: https://issues.apache.org/jira/browse/TIKA-2928
> Project: Tika
> Issue Type: Improvement
> Components: parser, server
> Affects Versions: 1.22
> Reporter: Desmond David
> Priority: Minor
> Fix For: 3.0.0-BETA
>
>
> So I have been attempting to parse some (somewhat non-standard) HTML
> documents using Tika and I have observed that if the document contains a
> less-than sign (<) as part of a tag's body, Tika parses it as the start of a
> new tag and eventually omits the rest of the text in the final document, up
> to the point when the next newline is to be entered.
> For example, consider the following HTML snippet:
>
> {code:html}
> <tr ><td > GFR<60 = Chronic Kidney Disease, GFR<15 = Kidney Failure
> </td></tr><tr ><td ></td></tr><tr ><td > ENZYMES & BILIRUBIN</td></tr>{code}
> The result is:
> {code:java}
> GFR
> ENZYMES & BILIRUBIN
> {code}
> Here, the rest of the content after the first `GFR` gets omitted. Based on
> this observation I think this means that the `<60` and it's subsequent
> characters are getting interpreted as part of a tag, and since are getting
> ignored. Then at some point, `</td></tr>` is encountered which short-circuits
> the execution and starts processing the next line.
> This behaviour was observed using both, the Tika App and the Tika Server.
> I think expected behaviour should be that all text within data tags (p, td,
> etc.) should be considered as raw text. Or at least Tika's behaviour should
> be configurable to be allowed to do so.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)