[jira] [Commented] (TIKA-3515) Korean chars not extracted correctly

Tim Allison (Jira) Mon, 09 Aug 2021 11:03:04 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396199#comment-17396199
 ]


Tim Allison commented on TIKA-3515:
-----------------------------------

I'm perplexed... :(  Y, those question marks are literally byte 0x37.  

If you put the files in an input directory and run tika app in batch mode, are 
they ok?

{noformat} java -jar tika-app.jar -J -t -i <input> -o <output> {noformat}

> Korean chars not extracted correctly
> ------------------------------------
>
>                 Key: TIKA-3515
>                 URL: https://issues.apache.org/jira/browse/TIKA-3515
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.27, 2.0.0-BETA
>         Environment: Windows 10, Liberica OpenJDK FULL x64 1.8.0_302
>            Reporter: Luís Filipe Nassif
>            Priority: Major
>         Attachments: Korean lessons_ Lesson 2 – Learnkorean.com.pdf, 
> LIVE-Seoul-ntfs-utf-16-be.txt, LIVE-Seoul-ntfs-utf-16-le.txt, 
> LIVE-Seoul-ntfs-utf-8.txt, LIVE-Seoul-ntfs-utf-8.txt_-x_output.xml, 
> LIVE-Seoul-ntfs-utf-8_-t_output.txt, Screen Shot 2021-08-06 at 5.50.04 
> PM.png, Screen Shot 2021-08-06 at 5.50.21 PM.png, Screen Shot tika-app.png, 
> image-2021-08-09-14-37-30-552.png, image-2021-08-09-14-38-26-763.png
>
>
> Some Korean chars are extracted as squares. The encodings of plain texts are 
> detected correctly. Maybe this is related with the content handler (just a 
> guess). I'll attach the triggering files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3515) Korean chars not extracted correctly

Reply via email to