[ 
https://issues.apache.org/jira/browse/TIKA-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16490491#comment-16490491
 ] 

Saurabh Patil commented on TIKA-2650:
-------------------------------------

Hey Tim,

I attached the one pdf for the demo. I have extract text from this PDF and 
stored in output.txt which I have attached here. on page 25 there is one word 
"cucumber". if you check the PDF "cu-" is coming in the first line and "cumber" 
goes to next line. same like this you get some other word "cab­bages" and 
"calling" also break due to next line on the same page.

while checking output following changes are shown in changed,

cu­-cumber  =>  cu­ cumber

cab­-bages  => cab­ bages

call­-ing   =>  call­ ing      

 

[^output.txt]

is there any solution that we can get concated words and make it the single 
word?

 

> Soft-hyphen is not extracted properly
> -------------------------------------
>
>                 Key: TIKA-2650
>                 URL: https://issues.apache.org/jira/browse/TIKA-2650
>             Project: Tika
>          Issue Type: Bug
>          Components: app
>    Affects Versions: 1.18
>            Reporter: Saurabh Patil
>            Priority: Blocker
>         Attachments: Peter Rabbit.pdf, output.txt
>
>
> We are tring to extract text from PDF. if PDF having any big word at the end 
> of line then after half word there is soft hyphen and remaining word goes to 
> next line. but which extracting these text TIKA automatically replace hyphen 
> with space.  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to