[jira] [Commented] (TIKA-2650) Soft-hyphen is not extracted properly

Saurabh Patil (JIRA) Fri, 25 May 2018 02:53:23 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16490491#comment-16490491
 ]


Saurabh Patil commented on TIKA-2650:
-------------------------------------

Hey Tim,

I attached the one pdf for the demo. I have extract text from this PDF and 
stored in output.txt which I have attached here. on page 25 there is one word 
"cucumber". if you check the PDF "cu-" is coming in the first line and "cumber" 
goes to next line. same like this you get some other word "cabbages" and 
"calling" also break due to next line on the same page.

while checking output following changes are shown in changed,

cu-cumber  =>  cu cumber

cab-bages  => cab bages

call-ing   =>  call ing      

 

[^output.txt]

is there any solution that we can get concated words and make it the single 
word?

 

> Soft-hyphen is not extracted properly
> -------------------------------------
>
>                 Key: TIKA-2650
>                 URL: https://issues.apache.org/jira/browse/TIKA-2650
>             Project: Tika
>          Issue Type: Bug
>          Components: app
>    Affects Versions: 1.18
>            Reporter: Saurabh Patil
>            Priority: Blocker
>         Attachments: Peter Rabbit.pdf, output.txt
>
>
> We are tring to extract text from PDF. if PDF having any big word at the end 
> of line then after half word there is soft hyphen and remaining word goes to 
> next line. but which extracting these text TIKA automatically replace hyphen 
> with space.  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2650) Soft-hyphen is not extracted properly

Reply via email to