[
https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Akash closed TIKA-3170.
-----------------------
Fix Version/s: 1.25
Resolution: Duplicate
Duplicate of TIKA-3131
> PDF extraction space issue
> --------------------------
>
> Key: TIKA-3170
> URL: https://issues.apache.org/jira/browse/TIKA-3170
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.24.1
> Reporter: Akash
> Priority: Major
> Fix For: 1.25
>
> Attachments: document_example.pdf, image-2020-08-18-20-23-16-159.png
>
>
> While extracting pdf files, we are observing spaces between some letters.
> As per below documentation,
> [https://tika.apache.org/1.24.1/api/org/apache/tika/parser/pdf/PDFParserConfig.html]
> we can resolve this by disabling Enable Auto Space property. But when we
> disable this value, we are getting an issue with another text.
> With Enable Auto Space
> < <p>*2014 C H A M B* R E 2 e S E S S I O N D E L A 5 4 e L É G I S L A T U R
> EK A M E R 2 e Z I T T I N G V A N D E 5 4 e Z I T T I N G S P E R I O D E
> 2015
> Without Enable Auto Space
> > <p>*2014CHA*MBRE 2e SESSION DE LA 54e LÉGISLATUREKAMER 2e ZITTING VAN DE
> > 54e ZITTINGSPERIODE2015
>
> Now there is no space between 2014 and CHAMBRE.
>
> Is there some configuration to over come this issue.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)