[ 
https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394996#comment-17394996
 ] 

Tim Allison edited comment on TIKA-3515 at 8/6/21, 9:55 PM:
------------------------------------------------------------

I'm seeing most of the text coming through for the PDF at least with pdfbox and 
with tika-app.jar -t > output.txt. Is the encoding getting corrupted in your 
shell when you redirect...is your shell set to use UTF8?

And I see the same thing in tika-app.  There are definitely two short runs of 
??/boxes at the beginning of the file, but I see those in PDFBox's Extract text.

Hmmmm....

bq. Tika cli -t and -A don't work, but -x and -h work

How are you viewing the output?  Can you attach the .txt file that's broken 
with the -t option?


was (Author: [email protected]):
I'm seeing most of the text coming through for the PDF at least with pdfbox and 
with tika-app.jar -t > output.txt. Is the encoding getting corrupted in your 
shell when you redirect?

And I see the same thing in tika-app.  There are definitely two short runs of 
??/boxes at the beginning of the file, but I see those in PDFBox's Extract text.

Hmmmm....

bq. Tika cli -t and -A don't work, but -x and -h work

How are you viewing the output?  Can you attach the .txt file that's broken 
with the -t option?

> Korean chars not extracted correctly
> ------------------------------------
>
>                 Key: TIKA-3515
>                 URL: https://issues.apache.org/jira/browse/TIKA-3515
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.27, 2.0.0-BETA
>            Reporter: Luís Filipe Nassif
>            Priority: Major
>         Attachments: Korean lessons_ Lesson 2 – Learnkorean.com.pdf, 
> LIVE-Seoul-ntfs-utf-16-be.txt, LIVE-Seoul-ntfs-utf-16-le.txt, 
> LIVE-Seoul-ntfs-utf-8.txt, Screen Shot 2021-08-06 at 5.50.04 PM.png, Screen 
> Shot 2021-08-06 at 5.50.21 PM.png, Screen Shot tika-app.png
>
>
> Some Korean chars are extracted as squares. The encodings of plain texts are 
> detected correctly. Maybe this is related with the content handler (just a 
> guess). I'll attach the triggering files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to