[
https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394902#comment-17394902
]
Luís Filipe Nassif edited comment on TIKA-3515 at 8/6/21, 5:15 PM:
-------------------------------------------------------------------
Some additional information:
# Happens with Tika cli (redirecting -t to file) and gui with all files (txt &
pdf)
# Pure PDFBox ExtractText works fine with the PDF file
# If I copy and paste 서울 in a JTextField it is shown as 2 squares...
# Tika cli -t and -A don't work, but -x and -h work
was (Author: lfcnassif):
Some additional information:
# Happens with Tika cli (redirecting -t to file) and gui with all files (txt &
pdf)
# Pure PDFBox ExtractText works fine with the PDF file
# If I copy and paste 서울 in a JTextField it is shown as 2 squares...
# Tika cli -t and -A doesn't work, but -x and -h work
> Korean chars not extracted correctly
> ------------------------------------
>
> Key: TIKA-3515
> URL: https://issues.apache.org/jira/browse/TIKA-3515
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.27, 2.0.0-BETA
> Reporter: Luís Filipe Nassif
> Priority: Major
> Attachments: Korean lessons_ Lesson 2 – Learnkorean.com.pdf,
> LIVE-Seoul-ntfs-utf-16-be.txt, LIVE-Seoul-ntfs-utf-16-le.txt,
> LIVE-Seoul-ntfs-utf-8.txt
>
>
> Some Korean chars are extracted as squares. The encodings of plain texts are
> detected correctly. Maybe this is related with the content handler (just a
> guess). I'll attach the triggering files.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)