[jira] [Commented] (TIKA-3170) PDF extraction space issue

Akash (Jira) Tue, 18 Aug 2020 07:34:22 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179656#comment-17179656
 ]


Akash commented on TIKA-3170:
-----------------------------

1 more observation. Extracted output remains same from tika app 1.9 to tika app 
1.24.

Difference is from tika app 1.24.1.

Any thing specific to pdf that has changed in version 1.24.1

> PDF extraction space issue
> --------------------------
>
>                 Key: TIKA-3170
>                 URL: https://issues.apache.org/jira/browse/TIKA-3170
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.24.1
>            Reporter: Akash
>            Priority: Major
>         Attachments: document_example.pdf
>
>
> While extracting pdf files, we are observing spaces between some letters.
> As per below documentation, 
> [https://tika.apache.org/1.24.1/api/org/apache/tika/parser/pdf/PDFParserConfig.html]
> we can resolve this by disabling Enable Auto Space property. But when we 
> disable this value, we are getting an issue with another text.
> With Enable Auto Space 
> < <p>*2014 C H A M B* R E 2 e S E S S I O N D E L A 5 4 e L É G I S L A T U R 
> EK A M E R 2 e Z I T T I N G V A N D E 5 4 e Z I T T I N G S P E R I O D E 
> 2015
> Without Enable Auto Space
> > <p>*2014CHA*MBRE 2e SESSION DE LA 54e LÉGISLATUREKAMER 2e ZITTING VAN DE 
> > 54e ZITTINGSPERIODE2015
>  
> Now there is no space between 2014 and CHAMBRE.
>  
> Is there some configuration to over come this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3170) PDF extraction space issue

Reply via email to