[ 
https://issues.apache.org/jira/browse/TIKA-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17088929#comment-17088929
 ] 

Claas Aug. edited comment on TIKA-3024 at 4/21/20, 6:19 PM:
------------------------------------------------------------

I reproduced the issue with a simple Word (or OpenDocument) file consisting of 
just one word:

               o*n*e

 This is *actually* parsed as follows:

{{       <p>o}}
 {{           <b>n</b>e}}
 {{       </p>}}

Which is *not correct*, because the whitespace between {{O}} and {{<b>}} 
implicitly causes a space (when you open it in a browser).

The *expected* result should be as follows:

 
 {{       }}{{<p>o<b>n</b>e}}{{</p>}}

A *workaround* seems to be to remove these line-breaks with the following 
regular expression: {{/\n +/}}

 PS: I have attached the file [^one.odt] (and the corresponding parser result 
[^one.odt-parsed.html]) with several examples.


was (Author: claasaug):
I reproduced the issue with a simple Word (or OpenDocument) file consisting of 
just one word:

o*n*e

 This is actually parsed as follows:


 {{       <p>o}}
{{        <b>n</b>e}}
{{        </p>}}

Which is not correct, because the whitespace between {{O}} and {{<b>}} 
implicitly causes a space (when you open it in a browser).

The expected result should be as follows:

 
 {{<p>o<b>n</b>e}}{{</p>}}

A workaround seems to be to remove these line-breaks with the following regular 
expression: {{/\n +/}}

 PS: I have attached the file [^one.odt] (and the corresponding parser result 
[^one.odt-parsed.html]) with several examples.

> Extra whitespace appended within a tag element's text
> -----------------------------------------------------
>
>                 Key: TIKA-3024
>                 URL: https://issues.apache.org/jira/browse/TIKA-3024
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.16, 1.20
>            Reporter: Vivek 
>            Priority: Major
>         Attachments: one.odt, one.odt-parsed.html
>
>
> Website: [http://www.thevanitycase.com/about-us.php]
> While parsing the content of the page using Tika Parser, it splits the text 
> in the tag and sends it to crawler4j for content handling. But the text is 
> contained within a single tag (span tag). The content handler appends extra 
> whitespace ("  ") as it normally does for any text received
> Text: "Tel: +91-22-61801700". 
>  That is, 
>  Expected text: "<text before this>Tel: +91-22-61801700<text after this>"
> Actual text: "<text before this>Tel: +91-22-6180170  0<text after this>"
> The JS path of the element: body > div > div:nth-child(6) > div > 
> div.footer-full.footer-btm > div > p > span



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to