[jira] [Commented] (TIKA-2650) Soft-hyphen is not extracted properly

Yauheni Salopiy (Jira) Wed, 19 Feb 2020 05:15:10 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17040011#comment-17040011
 ]


Yauheni Salopiy commented on TIKA-2650:
---------------------------------------

Hi [~tilman],

Do You mean that 
[https://tika.apache.org/1.23/api/org/apache/tika/parser/pdf/PDFParserConfig.html#setSortByPosition(boolean)]
 as true might help with correct text extraction in this particular case?


According to the description it might introduce other issues, am I right:

_If true, sort text tokens by their x/y position before extracting text. This 
may be necessary for some PDFs (if the text tokens are not rendered "in 
order"), while for other PDFs it can produce the wrong result (for example if 
there are 2 columns, the text will be interleaved). Default is false._

Best Regards,
Yauheni Salopiy

> Soft-hyphen is not extracted properly
> -------------------------------------
>
>                 Key: TIKA-2650
>                 URL: https://issues.apache.org/jira/browse/TIKA-2650
>             Project: Tika
>          Issue Type: Bug
>          Components: app
>    Affects Versions: 1.18
>            Reporter: Saurabh Patil
>            Priority: Blocker
>         Attachments: Peter Rabbit.pdf, document_example.pdf, 
> document_example.txt, output.txt
>
>
> We are tring to extract text from PDF. if PDF having any big word at the end 
> of line then after half word there is soft hyphen and remaining word goes to 
> next line. but which extracting these text TIKA automatically replace hyphen 
> with space.  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2650) Soft-hyphen is not extracted properly

Reply via email to