[jira] [Commented] (TIKA-2650) Soft-hyphen is not extracted properly

Tim Allison (JIRA) Thu, 24 May 2018 12:05:25 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489606#comment-16489606
 ]


Tim Allison commented on TIKA-2650:
-----------------------------------

Can you share with us exactly where the soft-hyphen isn't working?  I see it 
working sometimes.  Note that there is often a difference between the text as 
displayed and the text that is electronically stored (OCR'd?) within the PDF.

> Soft-hyphen is not extracted properly
> -------------------------------------
>
>                 Key: TIKA-2650
>                 URL: https://issues.apache.org/jira/browse/TIKA-2650
>             Project: Tika
>          Issue Type: Bug
>          Components: app
>    Affects Versions: 1.18
>            Reporter: Saurabh Patil
>            Priority: Blocker
>         Attachments: Peter Rabbit.pdf
>
>
> We are tring to extract text from PDF. if PDF having any big word at the end 
> of line then after half word there is soft hyphen and remaining word goes to 
> next line. but which extracting these text TIKA automatically replace hyphen 
> with space.  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2650) Soft-hyphen is not extracted properly

Reply via email to