[
https://issues.apache.org/jira/browse/DRILL-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601758#comment-17601758
]
ASF GitHub Bot commented on DRILL-8301:
---------------------------------------
pjfanning commented on PR #2637:
URL: https://github.com/apache/drill/pull/2637#issuecomment-1240555174
With jackson - JSON spec (https://www.ietf.org/rfc/rfc4627.txt) mandates
unicode with utf-8 as default. XML mandates utf-8 as default. Quite rare in my
experience to see other Unicode charsets used. Utf-8 encoding should use fewer
bytes for Latin alphabet based text and numeric data.
Java strings can now use utf-16 internally. I'm not sure if there is a
performance impact using utf-16 instead of utf-8
(https://www.dariawan.com/tutorials/java/java-9-compact-string-and-string-new-methods/).
My main concern is correctness and testability as opposed to performance.
Choosing one encoding for externally facing data and another internally would
introduce a lot of extra complexity and possibly confusion as to which to
choose in certain scenarios - and possibly lower performance as you would often
need to convert between the 2 encodings.
> Standardise on UTF-8 encoding for char to byte (and vice versa) conversions
> ---------------------------------------------------------------------------
>
> Key: DRILL-8301
> URL: https://issues.apache.org/jira/browse/DRILL-8301
> Project: Apache Drill
> Issue Type: Improvement
> Reporter: PJ Fanning
> Priority: Major
>
> Lots of Drill code uses UTF-8 explicitly. Lots more Drill code does not set
> an explicit encoding which means it relies on the JVM default (which differs
> by JVM install).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)