Re: Please add base-64 URL-safe alphabet

Amos Jeffries Wed, 28 Jan 2015 19:32:31 -0800

On 13/12/2014 7:49 a.m., Niels Möller wrote:
> Amos Jeffries <[email protected]> writes:
> 
>> The attached patch implements a proposed API/ABI extension adding
>> support for RFC 4648 section 5 "Base 64 Encoding with URL and Filename
>> Safe Alphabet"
> 
> Thanks. It makes sense to me to add support for this in one way or the
> other.
> 
>> External code simply calls the init() function relevant to the
>> alphabet it is needing to encode/decode with. The library internally
>> uses the context to select which lookup table to use for later base64
>> function calls.
> 
> Passing the alphabet to the init function and recording it in the context
> is nice in some ways, but it breaks the ABI (since the size of the
> context struct is part of the ABI). Now, maybe we have to break the ABI
> for 3.1 anyway, to introduce versioned symbols. I don't yet know for
> sure if an soname change is necessary or not.
> 
> The alternative to storing this in the context, is to introduce some
> general encoding/decoding functions which support arbitrary alphabets and
> take the needed table as an additional argument. Then base64_* and
> base64url_* would be simple wrapper functions passing the right table.
> 
>> The base64_encode_raw() and base64_encode_group() functions which do
>> not use contexts are left untouched for now.
> 
> Both functions are undocumented, so we can consider changing them if
> neeed. As for encode_group, that seems to be used only in the openpgp code
> which is not really in a working state. So we can probably ignore that one.
> 
> But base64_encode_raw is used by the main encoding function,
> base64_encode_update. So to support other alphabets one either needs to
> pass some kind of alphabet argument to base64_encode_raw, or introduce a
> new function with such an argument.
> 
> Does anyone know if applications are using base64_encode_raw, despite
> it's status as undocumented?
> 
>> +/* which alphabet to use */
>> +#define BASE64_ALPHABET 0
>> +#define BASE64URL_ALPHABET 1
> 
> I think I'd prefer to avoid an enumeration of available alphabets. And
> in the API you sketch, applications will never use these constants,
> right?


Yes.

> 
>>  /* Base64 encoding */
>>  
>>  /* Maximum length of output for base64_encode_update. NOTE: Doesn't
>> @@ -73,11 +79,17 @@ struct base64_encode_ctx
>>  {
>>    unsigned word;   /* Leftover bits */
>>    unsigned bits;  /* Number of bits, always 0, 2, or 4. */
>> +  unsigned alphabet; /* which alphabet to use for encoding */
>>  };
> 
> Assuming we go with adding stuff to the context struct, I'd prefer to
> add a pointer directly to the alphabet (and for decoding, a pointer
> directly to the lookup table). So we don't need any switch on an
> alphabet enum.
> 
>> diff --git a/base64-decode.c b/base64-decode.c
>> index f622baa..fbaf54f 100644
>> --- a/base64-decode.c
>> +++ b/base64-decode.c
>> @@ -43,7 +43,7 @@
>>  #define TABLE_END -3
>>  
>>  static const signed char
>> -decode_table[0x100] =
>> +default_decode_table[0x100] =
>>  {
>>    /* White space is HT, VT, FF, CR, LF and SPC */
>>    -1, -1, -1, -1, -1, -1, -1, -1, -1, -2, -2, -2, -2, -2, -1, -1, 
>> @@ -64,10 +64,40 @@ decode_table[0x100] =
>>    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
>>  };
>>  
>> +static const signed char
>> +urlextended_decode_table[0x100] =
>> +{
>> +  /* White space is HT, VT, FF, CR, LF and SPC */
>> +  -1, -1, -1, -1, -1, -1, -1, -1, -1, -2, -2, -2, -2, -2, -1, -1,
>> +  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
>> +  -2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 62, -1, -1,
>> +  52, 53, 54, 55, 56, 57, 58, 59, 60, 61, -1, -1, -1, -3, -1, -1,
>> +  -1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14,
>> +  15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, -1, -1, -1, -1, 63,
>> +  -1, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
>> +  41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, -1, -1, -1, -1, -1,
>> +  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
>> +  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
>> +  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
>> +  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
>> +  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
>> +  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
>> +  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
>> +  -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
>> +};
>> +
> 
> I think the base64url decoding functions, and its table, should be put
> in its own object file. Maybe static linking is obscure these days, but
> it's desirable that if one uses only the standard base64 alphabet, one
> shouldn't get dependencies causing the linker to drag in tables for
> other alphabets.
> 

Are you sure you want that?

Except for the init function and table the remainder of the code is all
shared. Separating the tables from the code means we would end up with
+4 files just holding lookup tables and have to move the table
symbols/pointers into a .h.


>>  void
>>  base64_decode_init(struct base64_decode_ctx *ctx)
>>  {
>>    ctx->word = ctx->bits = ctx->padding = 0;
>> +  ctx->alphabet = BASE64_ALPHABET;
>> +}
>> +
>> +void
>> +base64url_decode_init(struct base64_decode_ctx *ctx)
>> +{
>> +  ctx->word = ctx->bits = ctx->padding = 0;
>> +  ctx->alphabet = BASE64URL_ALPHABET;
>>  }
> 
> And then the assignment of ctx->alphabet would be
> 
>   ctx->table = base64_decode_table;
> 
> and
> 
>   ctx->table = base64url_decode_table;
> 
> in the respective init functions.
> 
>>  int
>> @@ -76,8 +106,11 @@ base64_decode_single(struct base64_decode_ctx *ctx,
>>                   uint8_t src)
>>  {
>>    int data;
>> -  
>> -  data = decode_table[src];
>> +
>> +  if (ctx->alphabet == BASE64URL_ALPHABET)
>> +    data = urlextended_decode_table[src];
>> +  else
>> +    data = default_decode_table[src];
> 
> And this piece of code would then simplify to
> 
>   data = ctx->table[src];
> 
>> @@ -155,6 +171,7 @@ base64_encode_single(struct base64_encode_ctx *ctx,
>>    unsigned done = 0;
>>    unsigned word = ctx->word << 8 | src;
>>    unsigned bits = ctx->bits + 8;
>> +  const uint8_t *encode_table = (ctx->alphabet == BASE64URL_ALPHABET ? 
>> urlextended_encode_table : default_encode_table);
>>    
>>    while (bits >= 6)
>>      {
> 
> I think this works, but only because the ENCODE macro which refers to
> the name "encode_table" now gets a local variable, instead of the global
> table in the old code. At a casual look, the local variable appears
> unused. I think it's easier to understand if this dependency where made
> more explicit.

Would you like that done as code documentation, manually unrolling
ENCODE, or replacing it with a static function?

> 
> Finally, are there other alphabets that are in use? You have
> mentioned, earlier, the alphabet used for crypt(3) (where is that
> documented???), and iirc, SRP files also use some unusual alphabet for
> base64.
> 
> Maybe we should make it reasonably easy to let applications supply
> custom alphabets? This should be almost trivial for encoding, while for
> decoding, we would either need to document the format of the lookup
> table, or provide a function to generate a decoding table for a given
> alphabet.

When the alphabet pointer is stored in the context the caller is able to
change that however and whenever they wish. For my first submission I
thought that was slightly too dangerous to propose.

But if you are happy to do it, then it resolves both the enumeration
existence and the ability for callers to use custom alphabets.


Amos
_______________________________________________
nettle-bugs mailing list
[email protected]
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Re: Please add base-64 URL-safe alphabet

Reply via email to