[ 
https://issues.apache.org/jira/browse/DAFFODIL-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Lawrence updated DAFFODIL-2561:
-------------------------------------
    Description: 
Comment from [~interran] in a pull request:
{quote}I reviewed how we call getBytes in Daffodil in order to check for 
inconsistencies and best practices. I noticed two things: 1) we call 
getBytes("ascii") instead every other place where we want bytes from ASCII 
characters; and 2) we call getBytes without a charset name too many times. 
Java's platform default charset is specific to the user and OS. On many modern 
Linux systems, it's UTF-8. On Macs, it’s MacRoman. In the US on Windows, it's 
often CP1250, while in Europe it's CP1252 or in China it's often simplified 
Chinese (Big5 or a GB*). I'm agnostic whether we use "ascii", "US-ASCII", or 
import java.nio.charset.StandardCharsets and use StandardCharsets.US_ASCII (I 
see Daffodil typically uses all-lowercase strings most often to save space and 
typing), but we probably should create a bug to replace all parameter-less 
getBytes calls with getBytes("utf-8").
{quote}
I *think* most/all of our uses of getBytes that don't provide an encoding are 
in tests. But even if it doesn't affect the Daffodil source, it does make our 
tests fragile to a users encoding, and we are not consistent at all. We should 
fix this so all uses provided an encoding, and our encodings are consistent.

Additionally, the String class has a constructor and accepts a byte array and 
an optional encoding. The same issue occurs if one does not provide an 
encoding. We should find all uses of this constructor and ensure they use an 
encoding.

  was:
Comment from [~interran] in a pull request:
{quote}I reviewed how we call getBytes in Daffodil in order to check for 
inconsistencies and best practices. I noticed two things: 1) we call 
getBytes("ascii") instead every other place where we want bytes from ASCII 
characters; and 2) we call getBytes without a charset name too many times. 
Java's platform default charset is specific to the user and OS. On many modern 
Linux systems, it's UTF-8. On Macs, it’s MacRoman. In the US on Windows, it's 
often CP1250, while in Europe it's CP1252 or in China it's often simplified 
Chinese (Big5 or a GB*). I'm agnostic whether we use "ascii", "US-ASCII", or 
import java.nio.charset.StandardCharsets and use StandardCharsets.US_ASCII (I 
see Daffodil typically uses all-lowercase strings most often to save space and 
typing), but we probably should create a bug to replace all parameter-less 
getBytes calls with getBytes("utf-8").{quote}

I *think* most/all of our uses of getBytes that don't provide an encoding are 
in tests. But even if it doesn't affect the Daffodil source, it does make our 
tests fragile to a users encoding, and we are not consistent at all. We should 
fix this so all uses provided an encoding, and our encodings are consistent.


> Fix uses of getBytes without an encoding specified
> --------------------------------------------------
>
>                 Key: DAFFODIL-2561
>                 URL: https://issues.apache.org/jira/browse/DAFFODIL-2561
>             Project: Daffodil
>          Issue Type: Bug
>          Components: Clean Ups
>            Reporter: Steve Lawrence
>            Priority: Major
>
> Comment from [~interran] in a pull request:
> {quote}I reviewed how we call getBytes in Daffodil in order to check for 
> inconsistencies and best practices. I noticed two things: 1) we call 
> getBytes("ascii") instead every other place where we want bytes from ASCII 
> characters; and 2) we call getBytes without a charset name too many times. 
> Java's platform default charset is specific to the user and OS. On many 
> modern Linux systems, it's UTF-8. On Macs, it’s MacRoman. In the US on 
> Windows, it's often CP1250, while in Europe it's CP1252 or in China it's 
> often simplified Chinese (Big5 or a GB*). I'm agnostic whether we use 
> "ascii", "US-ASCII", or import java.nio.charset.StandardCharsets and use 
> StandardCharsets.US_ASCII (I see Daffodil typically uses all-lowercase 
> strings most often to save space and typing), but we probably should create a 
> bug to replace all parameter-less getBytes calls with getBytes("utf-8").
> {quote}
> I *think* most/all of our uses of getBytes that don't provide an encoding are 
> in tests. But even if it doesn't affect the Daffodil source, it does make our 
> tests fragile to a users encoding, and we are not consistent at all. We 
> should fix this so all uses provided an encoding, and our encodings are 
> consistent.
> Additionally, the String class has a constructor and accepts a byte array and 
> an optional encoding. The same issue occurs if one does not provide an 
> encoding. We should find all uses of this constructor and ensure they use an 
> encoding.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to