[ 
https://issues.apache.org/jira/browse/TIKA-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Tyson updated TIKA-3307:
-----------------------------
    Description: 
Extracted text from some PDF files includes some strings with repeated 
(doubled) characters.

To reproduce the problem, download attached PDF file and run the following 
command:
{code:java}
java -jar ./tika-app-1.25.jar -T WSHP-PRC025F-EN_07132019.pdf | egrep 
'(.)\1(.)\2'

{code}
The bad strings all seem to be headings, so perhaps something in the font or 
other style features are causing the problem.

First detected in version 1.19, retested with 1.25. Did not test earlier 
versions.

  was:
Extracted text from some PDF files includes some strings with repeated 
(doubled) characters.

To reproduce the problem, download attached PDF file and run the following 
command:
{code:java}
java -jar ./tika-app-1.25.jar -T WSHP-PRC025F-EN_07132019.pdf | egrep 
'(.)\1(.)\2'

{code}
The bad strings all seem to be headings, so perhaps something in the font or 
other style features are causing the problem.


> extracted text strings have repeated characters
> -----------------------------------------------
>
>                 Key: TIKA-3307
>                 URL: https://issues.apache.org/jira/browse/TIKA-3307
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Paul Tyson
>            Priority: Major
>         Attachments: WSHP-PRC025F-EN_07132019.pdf
>
>
> Extracted text from some PDF files includes some strings with repeated 
> (doubled) characters.
> To reproduce the problem, download attached PDF file and run the following 
> command:
> {code:java}
> java -jar ./tika-app-1.25.jar -T WSHP-PRC025F-EN_07132019.pdf | egrep 
> '(.)\1(.)\2'
> {code}
> The bad strings all seem to be headings, so perhaps something in the font or 
> other style features are causing the problem.
> First detected in version 1.19, retested with 1.25. Did not test earlier 
> versions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to