Re: RFR 8025003: Base64 should be less strict with padding

Xueming Shen Tue, 12 Nov 2013 16:32:09 -0800

On 11/12/2013 03:32 PM, Bill Shannon wrote:

This still seems like an inconsistent, and inconvenient, approach to me.


You've decided that some encoding errors (i.e., missing pad characters)
can be ignored.  You're willing to assume that the missing characters aren't
missing data but just missing padding.  But if you find a padding character
where you don't expect it you won't assume that the missing data is zero.


"missing pad characters" in theory is not an encoding errors. As the RFC 
suggested, the
use of padding in base64 data is not required or used. They mainly serve the 
purpose of
providing the indication of "end of the data". This is why the padding 
character(s) is not
required (optional) by our decoder at first place. However, if the padding 
character(s) is
present, they need to be correctly encoded, otherwise, it's a malformed base64 
stream.

To address your strong request fore more "lenient" MIME decoder, we have 
updated the
spec and implementation to be a reasonable liberal for the incorrect padding at 
the end
of the mime base64 data as showed below

     xxxx =       unnecessary padding character at the end of encoded stream
     xxxx xx=     missing the last padding character
     xxxx xx=y    missing the last padding character, instead having a 
non-padding char

With the assumption that it still follows the "spirit" of the purpose of padding
character (as suggested by the RFC), to indicate the end of the data stream, no 
more
decoding is needed beyond the padding character. Yes, it makes the MIME decoder 
somewhat
"inconsistent" with our original design and the rest of other type of decoders, 
but we
thought it might provide the "convenience" requested.

But a single tangling byte at the end of the encoded data stream is obvious an 
encoding
error or transportation error. As I said, I don't think the decoder should try 
to rescue with
guess. The proposed change is to try to provide a simple mechanism that the 
application
can do some lifecircle/error management to recovery from the malformed data 
stream, if
desired. This is actually is NOT what j.u.Base64 is desired for. The primary 
goal is to provide
a set of easy/simple utility methods for base64 encoding/decoding, not such 
complicated
error recovery management, as the java.nio.charset.De/Encoder provides.

The JavaDoc definitely can be improved to provide a detailed use case, sample, 
if it
helps. But if it's definitely a no-go, maybe we can leave this for jdk9 for 
bigger surgery.

-Sherman.

These errors, and others, could represent actual missing data that the
user of the library should be notified about.  As it is, you're not
implementing "strict" and you're not implementing "lenient", you're
implementing "half-lenient".  You might as well implement strict and
require the user of the library to always wrap it in a way that all
errors are ignored in order to provide "lenient".

That might be ok if it were easier to wrap the library to ignore errors.
Perhaps a FilterInputStream could be written that would ignore the errors?
Perhaps you could even provide such a stream?

But that won't help with the non-stream methods.  The ByteBuffer approach
is pretty klunky.  If you think it's reasonable, it would help a lot to
include it as an example in the javadocs.

The fundamental problem here is that there's really nothing reasonable to
do if the data is not properly encoded.  You either reflect the error to
the user and tell them they can't have the data because it was corrupted,
or you make a best effort to give them as much data as you can and let the
user decide if it was corrupted.  It's not as if someone is going to write
a program that looks at the corrupt data and "corrects" it based on how
it was corrupted.

Xueming Shen wrote on 11/12/13 14:39:

Hi Bill,

I'm still not convinced that the Base64.Decoder should guess the missing bits
(to fill with zero) for the last dangling single byte in the input stream, even 
in
lenient mode. That said I understand it might be really desired to still be able
to decode such malformed base64 byte stream in the real world use scenario.
I am trying to find a solution that can addresses the real issue without
compromising the integrity of the input data. There is an "advanced" decode
method Base64.Decoder(ByteBuffer, ByteBuffer), it is currently specified and
implemented as


          *<p>  If the input buffer is not in valid Base64 encoding scheme
          * then some bytes may have been written to the output buffer
          * before IllegalArgumentException is thrown. The positions of
          * both input and output buffer will not be advanced in this case.
          *

So, if the stream is malformed, the current implementation of decode()
method throws IAE and reset the in and out buffer back to their original
position (throw away the decoded resulting bytes)

It might be reasonable to change it to

          *<p>  The decoding operation will stop and return {@code -1} if
          * the input buffer is not in valid Base64 encoding scheme and
          * the malformed-input error has been detected. The malformed
          * bytes begin at the input buffer's current (possibly advanced)
          * position. The output buffer's position will be advanced to
          * reflect the bytes written so far.

which means when there is malformed byte sequence, instead of throwing
an IAE it now returns -1 "normally" (better flow control?) and leaves the
positions of the input and output buffer at the place where it stops. So you
can recover the decoded result from the output buffer, and find out where
the malformed byte sequence starts (if desirable)

     ByteBuffer src = ByteBuffer.wrap(src_byte_array);
     ByteBuffer dst = ByteBuffer.wrap(dst_byte_array);
     int ret = dec.decode(src, dst);
     dst.flip();
     // do something for the resulting bytes

     if (ret<  0) {
         // do something for the malformed bytes in src
     }

Instead of -1, the return value can be a "negative value" of the length
of the bytes written to the output buffer, if really needed. Though the
"position" and "limit" of the ByteBuffer should provide enough info for
the access.

The error recovery mechanism appears to work perfectly for your use
scenario:-) the "only" downside"/inconvenience is that you will need to
wrap your byte array input/output with the java.nio ByteBuffer (which is
out recommended replacement for byte[] + length + offset anyway).

http://cr.openjdk.java.net/~sherman/base64_malformed/webrev/

Opinion?

Thanks!
-Sherman


On 11/08/2013 02:35 PM, Bill Shannon wrote:

Have you had a chance to think about this?  Can the MIME decoder be made
more lenient, or can I get an option to control this?

Bill Shannon wrote on 10/25/13 15:24:

Xueming Shen wrote on 10/25/13 15:19:

On 10/25/13 2:19 PM, Bill Shannon wrote:

If I understand this correctly, this proposes to remove the "lenient"
option we've been discussing and just make it always lenient.  Is that
correct?

Yes. Only for the mime type though.

That's fine.

Unfortunately, from what you say below, it's still not lenient enough.
I'd really like a version that never, ever, for any reason, throws an
exception.  Yes, that means when you only get a final 6 bits of data
you have to make an assumption about what was intended, probably padding
it with zeros to 8 bits.

This is something I'm hesitated to do. I can be lenient for the padding end
because the
padding character itself is not the real "data", whether or not it's present,
it's missing or
it's incorrect/incomplete, it does not impact the integrity of the data. But to
feed the last
6 bits with zero, is really kinda of guessing, NOT decoding.

I understand.  And if the people who write spamming software knew how to
read a spec, we wouldn't have this problem!  :-)

Still, there's a lot of bad data out there on the internet, and people
want the software to do the best job it can to interpret the data.  It's
better to guess at the missing 2 bits of data than to lose the last 6 bits
of data.

Re: RFR 8025003: Base64 should be less strict with padding

Reply via email to