[
https://issues.apache.org/jira/browse/TIKA-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17088929#comment-17088929
]
Claas Aug. edited comment on TIKA-3024 at 4/21/20, 6:19 PM:
------------------------------------------------------------
I reproduced the issue with a simple Word (or OpenDocument) file consisting of
just one word:
o*n*e
This is *actually* parsed as follows:
{{ <p>o}}
{{ <b>n</b>e}}
{{ </p>}}
Which is *not correct*, because the whitespace between {{O}} and {{<b>}}
implicitly causes a space (when you open it in a browser).
The *expected* result should be as follows:
{{ }}{{<p>o<b>n</b>e}}{{</p>}}
A *workaround* seems to be to remove these line-breaks with the following
regular expression: {{/\n +/}}
PS: I have attached the file [^one.odt] (and the corresponding parser result
[^one.odt-parsed.html]) with several examples.
was (Author: claasaug):
I reproduced the issue with a simple Word (or OpenDocument) file consisting of
just one word:
o*n*e
This is actually parsed as follows:
{{ <p>o}}
{{ <b>n</b>e}}
{{ </p>}}
Which is not correct, because the whitespace between {{O}} and {{<b>}}
implicitly causes a space (when you open it in a browser).
The expected result should be as follows:
{{<p>o<b>n</b>e}}{{</p>}}
A workaround seems to be to remove these line-breaks with the following regular
expression: {{/\n +/}}
PS: I have attached the file [^one.odt] (and the corresponding parser result
[^one.odt-parsed.html]) with several examples.
> Extra whitespace appended within a tag element's text
> -----------------------------------------------------
>
> Key: TIKA-3024
> URL: https://issues.apache.org/jira/browse/TIKA-3024
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.16, 1.20
> Reporter: Vivek
> Priority: Major
> Attachments: one.odt, one.odt-parsed.html
>
>
> Website: [http://www.thevanitycase.com/about-us.php]
> While parsing the content of the page using Tika Parser, it splits the text
> in the tag and sends it to crawler4j for content handling. But the text is
> contained within a single tag (span tag). The content handler appends extra
> whitespace (" ") as it normally does for any text received
> Text: "Tel: +91-22-61801700".
> That is,
> Expected text: "<text before this>Tel: +91-22-61801700<text after this>"
> Actual text: "<text before this>Tel: +91-22-6180170 0<text after this>"
> The JS path of the element: body > div > div:nth-child(6) > div >
> div.footer-full.footer-btm > div > p > span
--
This message was sent by Atlassian Jira
(v8.3.4#803005)