Hi Xavier,

Thank you. I am still unfamiliar with that part of Dillo, so please let
me know about any progress.

For now I'm still writing the RFC and doing some proof of concepts. I can see that it will take a while.

Limiting ourselves to a-z, A-Z and 0-9 would still account for 62 out of
the 75 possible characters, so roughly 82% of the set. I think that
removing the quoting in favour of the limited set reduce the risk for
broken implementations, yet still provide a good amount of randomness.

Yes, I think so too.

I am not sure whether this was an intentional modification from your
side. My patch is adding a <space> as defined by POSIX.1-2017 [1], so
that sizeof " " would always return 2. Was it your intention to flag
this potential confusion?

Yes, my point was that it is that is not easy to determine the length of a UTF-8 string by just looking at it.

Also, there was not strict reason to use sizeof " ". Any other character would do e.g.: sizeof "x", sizeof "A", etc.

Same problem:

แ• ๐“ ๐™ญ ั… ๐ฑ ๐—‘ โคซ ๐‘ฅ ๐˜… โคฌ แ™ฎ โจฏ ๐•ฉ ๐– ร— ๐” ๐šก x ๏ฝ˜ โ…น ๐”ต แ•ฝ ๐’™ ๐˜น

๐™ฐ A ๐ด แ—… ๐‘จ ๐šจ ๐•ฌ ๐–  ะ ฮ‘ ๐›ข ๐– ๐’œ ๐œœ ๐–ฝ€ ๐“ ๐ž ๏ผก ๐—” ๐˜ˆ แŽช ๊“ฎ ๐€ ๐”„ ๐”ธ ๐Š  ๐˜ผ

Check: https://util.unicode.org/UnicodeJsps/confusables.jsp

It is generally safer to use the explicit length.

You can also use dStr_append_c() to only append one character, so you only need a single character.

That would be an unnecessary use of the heap, because the size is static.

Not sure what you mean with "unnecessary use of the heap". I meant
something like this:

  static void generate_boundary(Dstr *boundary)
  {
     for (int i = 0; i < 70; i++) {
        /* Extracted from RFC 2046, section 5.1.1. */
        static const char set[] = "abcdefghijklmnopqrstuvwxyz"
           "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
           "0123456789";
        int c;

        do {
           c = rand() & 0xff;
        } while (!strchr(set, c));

        dStr_append_c(boundary, c);
     }
  }

The variable c is stored in a register, as you can see in the
disassembly (built with -Og):

hop% r2 build/src/dillo
WARN: Relocs has not been applied. Please use `-e bin.relocs.apply=true` or `-e 
bin.cache=true` next time
[0x000e3860]> s sym.generate_boundary_Dstr_
[0x0012cf99]> af
[0x0012cf99]> pdf
โ”Œ 85: sym.generate_boundary_Dstr_ (int64_t arg1);
โ”‚           ; arg int64_t arg1 @ rdi
โ”‚           0x0012cf99      55             push rbp                    ; 
Fl_Pixmap.H:1250 ; generate_boundary(Dstr*)
โ”‚           0x0012cf9a      4889e5         mov rbp, rsp
โ”‚           0x0012cf9d      4155           push r13
โ”‚           0x0012cf9f      4154           push r12
โ”‚           0x0012cfa1      53             push rbx
โ”‚           0x0012cfa2      4883ec08       sub rsp, 8
โ”‚           0x0012cfa6      4989fd         mov r13, rdi                ; arg1
โ”‚           0x0012cfa9      41bc00000000   mov r12d, 0                 ; 
Fl_Pixmap.H:1251
โ”‚       โ”Œโ”€< 0x0012cfaf      eb2c           jmp 0x12cfdd
โ”‚     โ”Œโ”Œโ”€โ”€> 0x0012cfb1      ff1509cd1e00   call qword [reloc.rand]     ; 
Fl_Pixmap.H:1253 ; [0x319cc0:8]=0
โ”‚     โ•Žโ•Žโ”‚   0x0012cfb7      0fb6d8         movzx ebx, al               ; <--- Perform the 
"& 0xff" by zero-extending the lowest byte in eax
โ”‚     โ•Žโ•Žโ”‚   0x0012cfba      89de           mov esi, ebx                ; 
Fl_Pixmap.H:1260
โ”‚     โ•Žโ•Žโ”‚   0x0012cfbc      488d3dbd5b..   lea rdi, obj.generate_boundary_Dstr_::set ; 
0x2a2b80 ; "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
โ”‚     โ•Žโ•Žโ”‚   0x0012cfc3      ff15afc51e00   call qword [reloc.strchr]   ; 
[0x319578:8]=0
โ”‚     โ•Žโ•Žโ”‚   0x0012cfc9      4885c0         test rax, rax
โ”‚     โ””โ”€โ”€โ”€< 0x0012cfcc      74e3           je 0x12cfb1
โ”‚      โ•Žโ”‚   0x0012cfce      89de           mov esi, ebx                ; 
Fl_Pixmap.H:1262
โ”‚      โ•Žโ”‚   0x0012cfd0      4c89ef         mov rdi, r13
โ”‚      โ•Žโ”‚   0x0012cfd3      67e830570800   call sym.dStr_append_c
โ”‚      โ•Žโ”‚   0x0012cfd9      4183c401       add r12d, 1                 ; 
Fl_Pixmap.H:1251
โ”‚      โ•Žโ”‚   ; CODE XREF from generate_boundary(Dstr*) @ 0x12cfaf(x)
โ”‚      โ•Žโ””โ”€> 0x0012cfdd      4183fc45       cmp r12d, 0x45              ; 'E'
โ”‚      โ””โ”€โ”€< 0x0012cfe1      7ece           jle 0x12cfb1
โ”‚           0x0012cfe3      4883c408       add rsp, 8                  ; 
Fl_Pixmap.H:1264
โ”‚           0x0012cfe7      5b             pop rbx
โ”‚           0x0012cfe8      415c           pop r12
โ”‚           0x0012cfea      415d           pop r13
โ”‚           0x0012cfec      5d             pop rbp
โ””           0x0012cfed      c3             ret

This method gives us a perfect uniform distribution when RAND_MAX is a
power of 2, at the cost of executing rand() more times than required
(70/(62/256) โ‰ˆ 289 on average). This other method has a slightly not
uniform distribution, but only runs rand() 70 times:

  static void generate_boundary(Dstr *boundary)
  {
     /* Extracted from RFC 2046, section 5.1.1. */
     static const char set[] = "abcdefghijklmnopqrstuvwxyz"
        "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
        "0123456789";
     static const int n = strlen(set);

     for (int i = 0; i < 70; i++) {
        int c = (unsigned char) set[rand() % n];
        dStr_append_c(boundary, c);
     }
  }

Which is probably okay for this case.

Notice I haven't tested any of these methods yet, I'll need to add a
form upload test case to be able to see them in action (or a unit test).

If we only use alphanumeric characters, we can just use isalnum() right?

According to POSIX.1-2017 [2], isalnum(3) depends on the current locale
configured by the system. For example, characters such as ร„ or รบ could
return non-zero.

Oh right, for some reason I was thinking this was the other way around,
and isalnum only worked with ASCII. I think we should review the other
uses of isalnum and friends as I think there may be used under similar
assumptions.

To avoid this, there are two possible solutions:

1. Use isalnum_l(3) to specify a locale_t object corresponding to the
"POSIX" locale (equivalent to "C" [3]), which must be previously
allocated by the newlocale(3) function [3] and released by the
freelocal(3) function [4]. A minimalist example is shown below:

       locale_t l = newlocale(LC_CTYPE, "POSIX", NULL);

       for (unsigned char i = 0; i < 255; i++)
               printf("hhu=%hhu, c=%c, isalnum=%d\n", i, i,
isalnum_l(i, l));

       freelocale(l);

2. Define a known subset from the portable character set defined by
POSIX.1-2017 [5] and use strspn(3), as already suggested by the patch.
IMHO this approach is better because:
        - It does not deal with locales, so developers not familiar with them
would understand the code better.
        - It is also portable outside a POSIX environment (not sure if this a
requirement, though).
        - It does not require dynamic allication via newlocale(3).
        - It is the only possible option if non-alnum characters, such as ':'
or '/', are appended to the boundary string.

Then, I think using our own set may be the most readable solution here,
which also allows the quick module method.

Best,
Rodrigo
_______________________________________________
Dillo-dev mailing list -- dillo-dev@mailman3.com
To unsubscribe send an email to dillo-dev-le...@mailman3.com

Reply via email to