[
https://issues.apache.org/jira/browse/TIKA-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vivek updated TIKA-3024:
-------------------------
Description:
Website: [http://www.thevanitycase.com/about-us.php]
While parsing the content of the page using Tika Parser, it splits the text in
the tag and sends it to crawler4j for content handling. But the text is
contained within a single tag (span tag). The content handler appends extra
whitespace (" ") as it normally does for any text received
Text: "Tel: +91-22-61801700".
That is,
Expected text: "<text before this>Tel: +91-22-61801700<text after this>"
Actual text: "<text before this>Tel: +91-22-6180170 0<text after this>"
The JS path of the element: body > div > div:nth-child(6) > div >
div.footer-full.footer-btm > div > p > span
Usually, double whitespace will be appended between every tag element text. But
here double whitespace is appended within a tag element text as parser detects
it as the content of 2 different HTML tags.
was:
Website: [http://www.thevanitycase.com/about-us.php]
While parsing the content of the page using Tika Parser, extra whitespace ("
") is appended in the text "Tel: +91-22-61801700". That is,
Expected text: "<text before this>Tel: +91-22-61801700<text after this>"
Actual text: "<text before this>Tel: +91-22-6180170 0<text after this>"
The JS path of the element: body > div > div:nth-child(6) > div >
div.footer-full.footer-btm > div > p > span
Usually, double whitespace will be appended between every tag element text. But
here double whitespace is appended within a tag element text.
> Extra whitespace appended within a tag element's text
> -----------------------------------------------------
>
> Key: TIKA-3024
> URL: https://issues.apache.org/jira/browse/TIKA-3024
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.16, 1.20
> Reporter: Vivek
> Priority: Major
>
> Website: [http://www.thevanitycase.com/about-us.php]
> While parsing the content of the page using Tika Parser, it splits the text
> in the tag and sends it to crawler4j for content handling. But the text is
> contained within a single tag (span tag). The content handler appends extra
> whitespace (" ") as it normally does for any text received
> Text: "Tel: +91-22-61801700".
> That is,
> Expected text: "<text before this>Tel: +91-22-61801700<text after this>"
> Actual text: "<text before this>Tel: +91-22-6180170 0<text after this>"
> The JS path of the element: body > div > div:nth-child(6) > div >
> div.footer-full.footer-btm > div > p > span
>
> Usually, double whitespace will be appended between every tag element text.
> But here double whitespace is appended within a tag element text as parser
> detects it as the content of 2 different HTML tags.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)