[jira] [Commented] (TIKA-3024) Extra whitespace appended within a tag element's text

Claas Aug. (Jira) Tue, 21 Apr 2020 11:10:15 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17088929#comment-17088929
 ]


Claas Aug. commented on TIKA-3024:
----------------------------------

I reproduced the issue with a simple Word (or OpenDocument) file consisting of 
just one word:

 > o*n*e

 This is actually parsed as follows:

```
        <p>o
        <b>n</b>e
        </p>

```

Which is not correct, because the whitespace between `O` and `<b>` implicitly 
causes a space (when you open it in a browser).

The expected result should be as follows:

 
 ```
 <p>o<b>n</b>e
 </p>

```

Note that currently the case above can still be distinguished from the 
following (with a space behind the "o")

> o *n*e

That is, the parser result is almost the same, except that there is an 
additional space between the "o" and the line feed.

Removing whitespace with this regex seems to do the trick: `/\n +/`

 

> Extra whitespace appended within a tag element's text
> -----------------------------------------------------
>
>                 Key: TIKA-3024
>                 URL: https://issues.apache.org/jira/browse/TIKA-3024
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.16, 1.20
>            Reporter: Vivek 
>            Priority: Major
>
> Website: [http://www.thevanitycase.com/about-us.php]
> While parsing the content of the page using Tika Parser, it splits the text 
> in the tag and sends it to crawler4j for content handling. But the text is 
> contained within a single tag (span tag). The content handler appends extra 
> whitespace ("  ") as it normally does for any text received
> Text: "Tel: +91-22-61801700". 
>  That is, 
>  Expected text: "<text before this>Tel: +91-22-61801700<text after this>"
> Actual text: "<text before this>Tel: +91-22-6180170  0<text after this>"
> The JS path of the element: body > div > div:nth-child(6) > div > 
> div.footer-full.footer-btm > div > p > span



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3024) Extra whitespace appended within a tag element's text

Reply via email to