[jira] [Updated] (TIKA-2928) Less than sign within tag boundaries considered as start of a new tag.

Desmond David (Jira) Thu, 22 Aug 2019 04:20:28 -0700


     [ 
https://issues.apache.org/jira/browse/TIKA-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Desmond David updated TIKA-2928:
--------------------------------
    Description: 
So I have been attempting to parse some (somewhat non-standard) HTML documents 
using Tika and I have observed that if the document contains a less-than sign 
(<) as part of a tag's body, Tika parses it as the start of a new tag and 
eventually omits the rest of the text in the final document, up to the point 
when the next newline is to be entered.

For example, consider the following HTML snippet:

 
{code:html}
<tr ><td > GFR<60 = Chronic Kidney Disease, GFR<15 = Kidney Failure 
</td></tr><tr ><td ></td></tr><tr ><td > ENZYMES & BILIRUBIN</td></tr>{code}
The result is:
{code:java}
GFR
ENZYMES & BILIRUBIN
{code}
Here, the rest of the content after the first `GFR` gets omitted. Based on this 
observation I think this means that the `<60`  and it's subsequent characters 
are getting interpreted as part of a tag, and since are getting ignored. Then 
at some point, `</td></tr>` is encountered which short-circuits the execution 
and starts processing the next line.

This behaviour was observed using both, the Tika App and the Tika Server.

I think expected behaviour should be that all text within data tags (p, td, 
etc.) should be considered as raw text. Or at least Tika's behaviour should be 
configurable to be allowed to do so.

 

  was:
So I have been attempting to parse some (somewhat non-standard) HTML documents 
using Tika and I have observed that if the document contains a less-than sign 
(<) as part of a tag's body, Tika parses it as the start of a new tag and 
eventually omits the rest of the text in the final document, up to the point 
when the next newline is to be entered.

For example, consider the following HTML snippet:

 
{code:html}
<tr ><td > GFR<60 = Chronic Kidney Disease, GFR<15 = Kidney Failure 
</td></tr><tr ><td ></td></tr><tr ><td > ENZYMES & BILIRUBIN</td></tr>{code}
The result is:
{code:java}
GFR
ENZYMES & BILIRUBIN
{code}
Here, the rest of the content after the first `GFR` gets omitted. Based on this 
observation I think this means that the `<60`  and it's subsequent characters 
are getting interpreted as part of a tag, and since are getting ignored. Then 
at some point, `</tr></td>` is encountered which short-circuits the execution 
and starts processing the next line.

This behaviour was observed using both, the Tika App and the Tika Server.

I think expected behaviour should be that all text within data tags (p, td, 
etc.) should be considered as raw text. Or at least Tika's behaviour should be 
configurable to be allowed to do so.

 


> Less than sign within tag boundaries considered as start of a new tag.
> ----------------------------------------------------------------------
>
>                 Key: TIKA-2928
>                 URL: https://issues.apache.org/jira/browse/TIKA-2928
>             Project: Tika
>          Issue Type: Bug
>          Components: parser, server
>    Affects Versions: 1.22
>            Reporter: Desmond David
>            Priority: Major
>
> So I have been attempting to parse some (somewhat non-standard) HTML 
> documents using Tika and I have observed that if the document contains a 
> less-than sign (<) as part of a tag's body, Tika parses it as the start of a 
> new tag and eventually omits the rest of the text in the final document, up 
> to the point when the next newline is to be entered.
> For example, consider the following HTML snippet:
>  
> {code:html}
> <tr ><td > GFR<60 = Chronic Kidney Disease, GFR<15 = Kidney Failure 
> </td></tr><tr ><td ></td></tr><tr ><td > ENZYMES & BILIRUBIN</td></tr>{code}
> The result is:
> {code:java}
> GFR
> ENZYMES & BILIRUBIN
> {code}
> Here, the rest of the content after the first `GFR` gets omitted. Based on 
> this observation I think this means that the `<60`  and it's subsequent 
> characters are getting interpreted as part of a tag, and since are getting 
> ignored. Then at some point, `</td></tr>` is encountered which short-circuits 
> the execution and starts processing the next line.
> This behaviour was observed using both, the Tika App and the Tika Server.
> I think expected behaviour should be that all text within data tags (p, td, 
> etc.) should be considered as raw text. Or at least Tika's behaviour should 
> be configurable to be allowed to do so.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (TIKA-2928) Less than sign within tag boundaries considered as start of a new tag.

Reply via email to