[ 
https://issues.apache.org/jira/browse/TIKA-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Claas Aug. updated TIKA-3024:
-----------------------------
    Attachment: one.odt
                one.odt-parsed.html

> Extra whitespace appended within a tag element's text
> -----------------------------------------------------
>
>                 Key: TIKA-3024
>                 URL: https://issues.apache.org/jira/browse/TIKA-3024
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.16, 1.20
>            Reporter: Vivek 
>            Priority: Major
>         Attachments: one.odt, one.odt-parsed.html
>
>
> Website: [http://www.thevanitycase.com/about-us.php]
> While parsing the content of the page using Tika Parser, it splits the text 
> in the tag and sends it to crawler4j for content handling. But the text is 
> contained within a single tag (span tag). The content handler appends extra 
> whitespace ("  ") as it normally does for any text received
> Text: "Tel: +91-22-61801700". 
>  That is, 
>  Expected text: "<text before this>Tel: +91-22-61801700<text after this>"
> Actual text: "<text before this>Tel: +91-22-6180170  0<text after this>"
> The JS path of the element: body > div > div:nth-child(6) > div > 
> div.footer-full.footer-btm > div > p > span



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to