[ 
https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17397391#comment-17397391
 ] 

Tim Allison commented on TIKA-3515:
-----------------------------------

If we're going to make this change in tika-app, I think we should also 
deprecate the initialization of WriteOutContentHandler and ToTextContentHandler 
with only an outputstream because these call Charset.getDefaultCharset().

We can also clean up defaultcharset in some of our unit tests.  I'm concerned 
about what might happen if we try to change then in the translators...I'll 
leave those alone.

If anyone has objections to any of the above, let me know.

> Tika CLI -t should use UTF-8 as default output encoding
> -------------------------------------------------------
>
>                 Key: TIKA-3515
>                 URL: https://issues.apache.org/jira/browse/TIKA-3515
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 1.27, 2.0.0-BETA
>         Environment: Windows 10, Liberica OpenJDK FULL x64 1.8.0_302
>            Reporter: Luís Filipe Nassif
>            Priority: Minor
>         Attachments: Korean lessons_ Lesson 2 – Learnkorean.com.pdf, 
> LIVE-Seoul-ntfs-utf-16-be.txt, LIVE-Seoul-ntfs-utf-16-le.txt, 
> LIVE-Seoul-ntfs-utf-8.txt, LIVE-Seoul-ntfs-utf-8.txt_-x_output.xml, 
> LIVE-Seoul-ntfs-utf-8_-t_output.txt, Screen Shot 2021-08-06 at 5.50.04 
> PM.png, Screen Shot 2021-08-06 at 5.50.21 PM.png, Screen Shot tika-app.png, 
> image-2021-08-09-14-37-30-552.png, image-2021-08-09-14-38-26-763.png
>
>
> Some Korean chars are extracted as squares. The encodings of plain texts are 
> detected correctly. Maybe this is related with the content handler (just a 
> guess). I'll attach the triggering files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to