[
https://issues.apache.org/jira/browse/TIKA-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17040308#comment-17040308
]
Tilman Hausherr commented on TIKA-2650:
---------------------------------------
I wrote that it depends. There is no perfect solution. In your example, the
unsorted is better.
Re soft-hyphens, how would one see the difference between a soft-hyphen and a
real one? These are just "-". They don't have a different char code. Sometimes
this could be a word that has a hyphen in the middle. For example,
"anti-misbruikbepaling" has the same "-" than "voorafbetalin-
gen". A solution would have to be dictionary-based.
> Soft-hyphen is not extracted properly
> -------------------------------------
>
> Key: TIKA-2650
> URL: https://issues.apache.org/jira/browse/TIKA-2650
> Project: Tika
> Issue Type: Bug
> Components: app
> Affects Versions: 1.18
> Reporter: Saurabh Patil
> Priority: Blocker
> Attachments: Peter Rabbit.pdf, document_example.pdf,
> document_example.txt, document_example_w_sort.txt,
> document_example_wo_sort.txt, output.txt
>
>
> We are tring to extract text from PDF. if PDF having any big word at the end
> of line then after half word there is soft hyphen and remaining word goes to
> next line. but which extracting these text TIKA automatically replace hyphen
> with space.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)