[jira] [Commented] (TIKA-2650) Soft-hyphen is not extracted properly

Yauheni Salopiy (Jira) Wed, 19 Feb 2020 09:05:30 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17040240#comment-17040240
 ]


Yauheni Salopiy commented on TIKA-2650:
---------------------------------------

Hi [~tilman],

I tried the option You suggested (org.apache.tika.parser.pdf.PDFParser, 
sortByPosition=true) and this made the results even worst :)

Now text from different columns interleaved and the issue with hyphens is still 
there.

Please, see for reference:
 * [^document_example_w_sort.txt] - extracted text with sortByPosition=true
 * [^document_example_wo_sort.txt] - extracted text with sortByPosition=false

Best Regards,
Yauheni Salopiy

> Soft-hyphen is not extracted properly
> -------------------------------------
>
>                 Key: TIKA-2650
>                 URL: https://issues.apache.org/jira/browse/TIKA-2650
>             Project: Tika
>          Issue Type: Bug
>          Components: app
>    Affects Versions: 1.18
>            Reporter: Saurabh Patil
>            Priority: Blocker
>         Attachments: Peter Rabbit.pdf, document_example.pdf, 
> document_example.txt, document_example_w_sort.txt, 
> document_example_wo_sort.txt, image-2020-02-19-12-03-19-968.png, output.txt
>
>
> We are tring to extract text from PDF. if PDF having any big word at the end 
> of line then after half word there is soft hyphen and remaining word goes to 
> next line. but which extracting these text TIKA automatically replace hyphen 
> with space.  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2650) Soft-hyphen is not extracted properly

Reply via email to