[
https://issues.apache.org/jira/browse/DAFFODIL-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Lawrence updated DAFFODIL-2561:
-------------------------------------
Description:
Comment from [~interran] in a pull request:
{quote}I reviewed how we call getBytes in Daffodil in order to check for
inconsistencies and best practices. I noticed two things: 1) we call
getBytes("ascii") instead every other place where we want bytes from ASCII
characters; and 2) we call getBytes without a charset name too many times.
Java's platform default charset is specific to the user and OS. On many modern
Linux systems, it's UTF-8. On Macs, it’s MacRoman. In the US on Windows, it's
often CP1250, while in Europe it's CP1252 or in China it's often simplified
Chinese (Big5 or a GB*). I'm agnostic whether we use "ascii", "US-ASCII", or
import java.nio.charset.StandardCharsets and use StandardCharsets.US_ASCII (I
see Daffodil typically uses all-lowercase strings most often to save space and
typing), but we probably should create a bug to replace all parameter-less
getBytes calls with getBytes("utf-8").
{quote}
I *think* most/all of our uses of getBytes that don't provide an encoding are
in tests. But even if it doesn't affect the Daffodil source, it does make our
tests fragile to a users encoding, and we are not consistent at all. We should
fix this so all uses provided an encoding, and our encodings are consistent.
Additionally, the String class has a constructor and accepts a byte array and
an optional encoding. The same issue occurs if one does not provide an
encoding. We should find all uses of this constructor and ensure they use an
encoding.
was:
Comment from [~interran] in a pull request:
{quote}I reviewed how we call getBytes in Daffodil in order to check for
inconsistencies and best practices. I noticed two things: 1) we call
getBytes("ascii") instead every other place where we want bytes from ASCII
characters; and 2) we call getBytes without a charset name too many times.
Java's platform default charset is specific to the user and OS. On many modern
Linux systems, it's UTF-8. On Macs, it’s MacRoman. In the US on Windows, it's
often CP1250, while in Europe it's CP1252 or in China it's often simplified
Chinese (Big5 or a GB*). I'm agnostic whether we use "ascii", "US-ASCII", or
import java.nio.charset.StandardCharsets and use StandardCharsets.US_ASCII (I
see Daffodil typically uses all-lowercase strings most often to save space and
typing), but we probably should create a bug to replace all parameter-less
getBytes calls with getBytes("utf-8").{quote}
I *think* most/all of our uses of getBytes that don't provide an encoding are
in tests. But even if it doesn't affect the Daffodil source, it does make our
tests fragile to a users encoding, and we are not consistent at all. We should
fix this so all uses provided an encoding, and our encodings are consistent.
> Fix uses of getBytes without an encoding specified
> --------------------------------------------------
>
> Key: DAFFODIL-2561
> URL: https://issues.apache.org/jira/browse/DAFFODIL-2561
> Project: Daffodil
> Issue Type: Bug
> Components: Clean Ups
> Reporter: Steve Lawrence
> Priority: Major
>
> Comment from [~interran] in a pull request:
> {quote}I reviewed how we call getBytes in Daffodil in order to check for
> inconsistencies and best practices. I noticed two things: 1) we call
> getBytes("ascii") instead every other place where we want bytes from ASCII
> characters; and 2) we call getBytes without a charset name too many times.
> Java's platform default charset is specific to the user and OS. On many
> modern Linux systems, it's UTF-8. On Macs, it’s MacRoman. In the US on
> Windows, it's often CP1250, while in Europe it's CP1252 or in China it's
> often simplified Chinese (Big5 or a GB*). I'm agnostic whether we use
> "ascii", "US-ASCII", or import java.nio.charset.StandardCharsets and use
> StandardCharsets.US_ASCII (I see Daffodil typically uses all-lowercase
> strings most often to save space and typing), but we probably should create a
> bug to replace all parameter-less getBytes calls with getBytes("utf-8").
> {quote}
> I *think* most/all of our uses of getBytes that don't provide an encoding are
> in tests. But even if it doesn't affect the Daffodil source, it does make our
> tests fragile to a users encoding, and we are not consistent at all. We
> should fix this so all uses provided an encoding, and our encodings are
> consistent.
> Additionally, the String class has a constructor and accepts a byte array and
> an optional encoding. The same issue occurs if one does not provide an
> encoding. We should find all uses of this constructor and ensure they use an
> encoding.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)