[ https://issues.apache.org/jira/browse/TIKA-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paul Tyson updated TIKA-3307: ----------------------------- Description: Extracted text from some PDF files includes some strings with repeated (doubled) characters. To reproduce the problem, download attached PDF file and run the following command: {code:java} java -jar ./tika-app-1.25.jar -T WSHP-PRC025F-EN_07132019.pdf | egrep '(.)\1(.)\2' {code} The bad strings all seem to be headings, so perhaps something in the font or other style features are causing the problem. First detected in version 1.19, retested with 1.25. Did not test earlier versions. was: Extracted text from some PDF files includes some strings with repeated (doubled) characters. To reproduce the problem, download attached PDF file and run the following command: {code:java} java -jar ./tika-app-1.25.jar -T WSHP-PRC025F-EN_07132019.pdf | egrep '(.)\1(.)\2' {code} The bad strings all seem to be headings, so perhaps something in the font or other style features are causing the problem. > extracted text strings have repeated characters > ----------------------------------------------- > > Key: TIKA-3307 > URL: https://issues.apache.org/jira/browse/TIKA-3307 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Paul Tyson > Priority: Major > Attachments: WSHP-PRC025F-EN_07132019.pdf > > > Extracted text from some PDF files includes some strings with repeated > (doubled) characters. > To reproduce the problem, download attached PDF file and run the following > command: > {code:java} > java -jar ./tika-app-1.25.jar -T WSHP-PRC025F-EN_07132019.pdf | egrep > '(.)\1(.)\2' > {code} > The bad strings all seem to be headings, so perhaps something in the font or > other style features are causing the problem. > First detected in version 1.19, retested with 1.25. Did not test earlier > versions. -- This message was sent by Atlassian Jira (v8.3.4#803005)