[
https://issues.apache.org/jira/browse/TIKA-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914035#comment-16914035
]
Desmond David edited comment on TIKA-2928 at 8/23/19 10:50 AM:
---------------------------------------------------------------
Ok, I tested this out with Jsoup and it appears that Jsoup handles this
correctly:
{code:java}
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Test {
public static void main(String[] args) {
String str = "<tr ><td > GFR<60 = Chronic Kidney Disease,
GFR<15 = Kidney Failure </td></tr>";
Document doc = Jsoup.parse(str);
Element e = doc.getAllElements().get(0);
System.out.println(e.text());
}
}{code}
Outputs
{code:java}
GFR<60 = Chronic Kidney Disease, GFR<15 = Kidney Failure{code}
Which is as expected.
Edit:
It appears that Jsoup escapes non-html characters by default when it parses the
string.
was (Author: sargent_d):
Ok, I tested this out with Jsoup and it appears that Jsoup handles this
correctly:
{code:java}
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Test {
public static void main(String[] args) {
String str = "<tr ><td > GFR<60 = Chronic Kidney Disease,
GFR<15 = Kidney Failure </td></tr>";
Document doc = Jsoup.parse(str);
Element e = doc.getAllElements().get(0);
System.out.println(e.text());
}
}{code}
Outputs
{code:java}
GFR<60 = Chronic Kidney Disease, GFR<15 = Kidney Failure{code}
Which is as expected.
> Less than sign within tag boundaries considered as start of a new tag.
> ----------------------------------------------------------------------
>
> Key: TIKA-2928
> URL: https://issues.apache.org/jira/browse/TIKA-2928
> Project: Tika
> Issue Type: Improvement
> Components: parser, server
> Affects Versions: 1.22
> Reporter: Desmond David
> Priority: Minor
>
> So I have been attempting to parse some (somewhat non-standard) HTML
> documents using Tika and I have observed that if the document contains a
> less-than sign (<) as part of a tag's body, Tika parses it as the start of a
> new tag and eventually omits the rest of the text in the final document, up
> to the point when the next newline is to be entered.
> For example, consider the following HTML snippet:
>
> {code:html}
> <tr ><td > GFR<60 = Chronic Kidney Disease, GFR<15 = Kidney Failure
> </td></tr><tr ><td ></td></tr><tr ><td > ENZYMES & BILIRUBIN</td></tr>{code}
> The result is:
> {code:java}
> GFR
> ENZYMES & BILIRUBIN
> {code}
> Here, the rest of the content after the first `GFR` gets omitted. Based on
> this observation I think this means that the `<60` and it's subsequent
> characters are getting interpreted as part of a tag, and since are getting
> ignored. Then at some point, `</td></tr>` is encountered which short-circuits
> the execution and starts processing the next line.
> This behaviour was observed using both, the Tika App and the Tika Server.
> I think expected behaviour should be that all text within data tags (p, td,
> etc.) should be considered as raw text. Or at least Tika's behaviour should
> be configurable to be allowed to do so.
>
--
This message was sent by Atlassian Jira
(v8.3.2#803003)