Akash created TIKA-3170:
---------------------------
Summary: PDF extraction space issue
Key: TIKA-3170
URL: https://issues.apache.org/jira/browse/TIKA-3170
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.24.1
Reporter: Akash
Attachments: document_example.pdf
While extracting pdf files, we are observing spaces between some letters.
As per below documentation,
[https://tika.apache.org/1.24.1/api/org/apache/tika/parser/pdf/PDFParserConfig.html]
we can resolve this by disabling Enable Auto Space property. But when we
disable this value, we are getting an issue with another text.
With Enable Auto Space
< <p>*2014 C H A M B* R E 2 e S E S S I O N D E L A 5 4 e L É G I S L A T U R
EK A M E R 2 e Z I T T I N G V A N D E 5 4 e Z I T T I N G S P E R I O D E 2015
Without Enable Auto Space
> <p>*2014CHA*MBRE 2e SESSION DE LA 54e LÉGISLATUREKAMER 2e ZITTING VAN DE 54e
> ZITTINGSPERIODE2015
Now there is no space between 2014 and CHAMBRE.
Is there some configuration to over come this issue.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)