aherbert commented on issue #46:
URL: https://github.com/apache/commons-codec/pull/46#issuecomment-617406675
My aim was not to dishearten, it was to try and maintain a common interface
and functionality in the different instances of BaseNCodec.
The 72 character line wrap was an example. This is an option specified in
the constructor for Base32/64. The default for URL safe encoding in Base64 is
76 chars. The default is 0 for no line wrapping.
I agree that whitespace support is not required. But it is also not
unexpected given how Base32 and Base64 function. The use of a decode table to
translate the bytes will have a similar runtime to your use of Hex. Each
character must be checked to see it is valid. At the business end Hex does this
on each character:
```java
final int digit = Character.digit(ch, 16);
if (digit == -1) {
throw new DecoderException("Illegal hexadecimal character " + ch + " at
index " + index);
}
return digit;
```
I am suggesting we skip conversion of input byte[] to char[] and the use of
Hex. Just do the decoding direct on the byte[] using a decode table.
Regarding line wrapping it is an expected feature of Base64 encoded data. As
as you say there are standard formats used where this is done. The support has
been carried through to Base32. I do not know if this has precedence or it was
done because it could be. I was suggesting doing the same for Base16. If you
encode data and write it somewhere it may be nice to line wrap this for
convenience when viewing it. Supporting a line wrap when encoding is thus for
human readable consumption. I find it more likely to use line wrapping on the
encoded output than to use a character set that is anything other than UTF-8
(or more specifically US_ASCII).
But what about decoding? You have to support it if you are expecting the
input to be line wrapped. But perhaps you are not. So what to do when the input
is not a valid hex character during decoding is a separate issue. There is a
case for adding support to BaseNCodec to throw an exception when the input byte
is not a valid character in the alphabet. Currently they are ignored to allow
whitespace padding. But this does of course allow you to put in junk and this
is also ignored.
Then we have 3 cases for a non-alphabet character during decoding:
- Ignore
- Detect if whitespace/padding and ignore; otherwise throw an exception
- Throw an exception
Currently Base16 would throw an exception. If you passed the exact same data
to Base32/64 they would decode it as it is a valid alphabet (although they may
throw an exception if using strict decoding depending on the trailing
characters). This difference of behaviour is odd. The simplest option for
Base16 is to do what Base32/64 do and ignore the invalid character.
Cleaning up exception functionality should be part of a separate fix. To me
it seems out of place to allow whitespace when decoding Base32/64 but not
Base16. In addition you can encode to Base32/64 with whitespace but not with
Base16.
I would suggest fixing Base16 to work as per Base32/64 in this PR. Then
fixing BaseNCodec to have an option flag to throw an exception when decoding if
an invalid character is encountered. This functionality would be off by default
for backward compatibility in Base32/64. The flag could be carried through to
the BaseNCodecInput/OutputStream allowing you to optionally throw exceptions
during streaming decoding when invalid characters are present.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]