[ 
https://issues.apache.org/jira/browse/DRILL-8301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601758#comment-17601758
 ] 

ASF GitHub Bot commented on DRILL-8301:
---------------------------------------

pjfanning commented on PR #2637:
URL: https://github.com/apache/drill/pull/2637#issuecomment-1240555174

   With jackson - JSON spec (https://www.ietf.org/rfc/rfc4627.txt) mandates 
unicode with utf-8 as default. XML mandates utf-8 as default. Quite rare in my 
experience to see other Unicode charsets used. Utf-8 encoding should use fewer 
bytes for Latin alphabet based text and numeric data.
   
   Java strings can now use utf-16 internally. I'm not sure if there is a 
performance impact using utf-16 instead of utf-8 
(https://www.dariawan.com/tutorials/java/java-9-compact-string-and-string-new-methods/).
   
   My main concern is correctness and testability as opposed to performance. 
Choosing one encoding for externally facing data and another internally would 
introduce a lot of extra complexity and possibly confusion as to which to 
choose in certain scenarios - and possibly lower performance as you would often 
need to convert between the 2 encodings.




> Standardise on UTF-8 encoding for char to byte (and vice versa) conversions
> ---------------------------------------------------------------------------
>
>                 Key: DRILL-8301
>                 URL: https://issues.apache.org/jira/browse/DRILL-8301
>             Project: Apache Drill
>          Issue Type: Improvement
>            Reporter: PJ Fanning
>            Priority: Major
>
> Lots of Drill code uses UTF-8 explicitly. Lots more Drill code does not set 
> an explicit encoding which means it relies on the JVM default (which differs 
> by JVM install).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to