[jira] [Comment Edited] (TIKA-2928) Less than sign within tag boundaries considered as start of a new tag.

Desmond David (Jira) Fri, 23 Aug 2019 03:51:11 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914035#comment-16914035
 ]


Desmond David edited comment on TIKA-2928 at 8/23/19 10:50 AM:
---------------------------------------------------------------

Ok, I tested this out with Jsoup and it appears that Jsoup handles this 
correctly:
{code:java}
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Test {
        public static void main(String[] args) {
                String str = "<tr ><td > GFR<60 = Chronic Kidney Disease, 
GFR<15 = Kidney Failure </td></tr>";
                Document doc = Jsoup.parse(str);
                Element e = doc.getAllElements().get(0);
                System.out.println(e.text());
        }
}{code}
Outputs
{code:java}
GFR<60 = Chronic Kidney Disease, GFR<15 = Kidney Failure{code}
Which is as expected.

Edit:

It appears that Jsoup escapes non-html characters by default when it parses the 
string.


was (Author: sargent_d):
Ok, I tested this out with Jsoup and it appears that Jsoup handles this 
correctly:
{code:java}
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Test {
        public static void main(String[] args) {
                String str = "<tr ><td > GFR<60 = Chronic Kidney Disease, 
GFR<15 = Kidney Failure </td></tr>";
                Document doc = Jsoup.parse(str);
                Element e = doc.getAllElements().get(0);
                System.out.println(e.text());
        }
}{code}
Outputs
{code:java}
GFR<60 = Chronic Kidney Disease, GFR<15 = Kidney Failure{code}
Which is as expected.

> Less than sign within tag boundaries considered as start of a new tag.
> ----------------------------------------------------------------------
>
>                 Key: TIKA-2928
>                 URL: https://issues.apache.org/jira/browse/TIKA-2928
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser, server
>    Affects Versions: 1.22
>            Reporter: Desmond David
>            Priority: Minor
>
> So I have been attempting to parse some (somewhat non-standard) HTML 
> documents using Tika and I have observed that if the document contains a 
> less-than sign (<) as part of a tag's body, Tika parses it as the start of a 
> new tag and eventually omits the rest of the text in the final document, up 
> to the point when the next newline is to be entered.
> For example, consider the following HTML snippet:
>  
> {code:html}
> <tr ><td > GFR<60 = Chronic Kidney Disease, GFR<15 = Kidney Failure 
> </td></tr><tr ><td ></td></tr><tr ><td > ENZYMES & BILIRUBIN</td></tr>{code}
> The result is:
> {code:java}
> GFR
> ENZYMES & BILIRUBIN
> {code}
> Here, the rest of the content after the first `GFR` gets omitted. Based on 
> this observation I think this means that the `<60`  and it's subsequent 
> characters are getting interpreted as part of a tag, and since are getting 
> ignored. Then at some point, `</td></tr>` is encountered which short-circuits 
> the execution and starts processing the next line.
> This behaviour was observed using both, the Tika App and the Tika Server.
> I think expected behaviour should be that all text within data tags (p, td, 
> etc.) should be considered as raw text. Or at least Tika's behaviour should 
> be configurable to be allowed to do so.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Comment Edited] (TIKA-2928) Less than sign within tag boundaries considered as start of a new tag.

Reply via email to