On Mon, 12 Feb 2007, Bram Kuijvenhoven wrote:
> Hi Michael, > Thanks for your reply! > > I have done a little more research; details are below. > > Michael Van Canneyt wrote: > > On Mon, 12 Feb 2007, Bram Kuijvenhoven wrote: > > > The Base64 decoding stream in the Base64 unit in the FCL appears to be > > > broken. > > > In particular > > > - it does not handle whitespace (or other characters not from the bas64 > > > alphabet), and > > > > It was not designed to do this. > > There is code trying to deal with bytes decoded as 99 (a sentinel value), but > it is not really functional. > > Looking more detailed at section 2.3 of RFC3548 > (http://www.ietf.org/rfc/rfc3548.txt), it is indeed stated that > > 'Implementations MUST reject the encoding if it contains characters > outside the base alphabet when interpreting base encoded data, unless > the specification referring to this document explicitly states > otherwise. Such specifications may, as MIME does, instead state that > characters outside the base encoding alphabet should simply be > ignored when interpreting data ("be liberal in what you accept"). > Note that this means that any CRLF constitute "non alphabet > characters" and are ignored.' > > So the current design is not wrong (from the perspective of RFC3548 alone); > only in the current implementation it not actually rejects these chars (e.g. > it does not raise an exception). > > RFC2045, about MIME, mentioned at a comment at the top in base64.pp, states > however that these chars must be ignored instead of rejected. I recommend to > support this 'mode' as well, as it allows line breaks, which is quite useful > in textual environments. Maybe you can add a property 'StrictRFC' or so ? > > > > - Read does not return the correct number of bytes read when at the end of > > > the > > > stream, and > > > - the meaning of the EOF is a little bit unclear > > > > > > These bugs can be circumvented by filtering out whitespace first and > > > calling > > > Size [which works!] to determine the actual stream size, but this is of > > > course > > > ugly, slower and not working on non-seekable streams. > > > > The non-seekable streams should remain supported. > > The point is that it currently only supports seekable streams (when calling > GetSize). For Read, it doesn't need seeking of course, and this will remain > so. > > There is not only choice in ignoring/rejecting non-base alphabet characters, > but also in dealing with '=' pad characters that are not at the end of the > string. (Up to two of such '=' pad characters have to be used at the end to > complete the last 4-byte sequence; this last sequence will encode 3 - nr. of > '='s at the end bytes instead of 3. (Normally, in Base64 encoding, sequences > of 4 bytes are used to encode 3 bytes from the input.)) > > In particular, the current 'getsize' calculation of TBase64DecodeStream (NB > this is done in the Seek method) seeks to the end of the input stream, reads > the last two characters, and determines from whether these are '=' or '==' by > how much the 'div 4 * 3' calculated stream size should be decreased. If we > signal an EOF at the first occurence of a '=' (besides the end of the input > stream), this calculation can go very wrong. The best behavior would be to > raise an exception here I think (when following RFC3548. I also think an exception is best. > > If I understand it correctly, input strings should also always have a byte > length that is a multiple of 4. > > I propose to let TBase64DecodeStream have two 'modes': > - 'strict mode': follows RFC3548, rejects any characters outside of base64 > alphabet, only accepts up to two '=' characters at the end, and requires the > input to have a Size being a multiple of 4; otherwise raises an > EBase64DecodeException OK, ignore my proposal for the property earlier. I didn't see this :-) > - 'MIME mode': follows RFC2045, ignores any characters outside of base64 > alphabet, takes any '=' as end of string, handles apparently truncated input > streams gracefully > Also, I'd tend to make MIME mode the default, but I leave that up to you (core > devs) to decide. No, that's fine, as it'll probably be the most used mode > > In code: > > TBase64DecodeMode = (bdmStrict, bdmMIME); > ... > TBase64DecodeStream = class(TSteam) > ... > property Mode:TBase64DecodeMode ... Excellent ! > > Note that in MIME mode, GetSize will Read the entire stream, whereas Strict > mode allows the calculation currently done in Seek. > > > If you test it, please try adding tests in fpcunit format. > > What would be a good starting point to learn about how to use fpcunit? (I > haven't used it before) The sources and examples :-) I once wrote an article about it (for beginners) if you want I can send it to you. Michael. _______________________________________________ fpc-devel maillist - [email protected] http://lists.freepascal.org/mailman/listinfo/fpc-devel
