Hi Rodrigo,

> Not sure what you mean with "unnecessary use of the heap". I meant
> something like this: 

You are right: I was too fixated with strcspn(3), when in fact strchr(3)
would already do. Also, this use of dStr_append_c() looks good to me.

> The variable c is stored in a register, as you can see in the
> disassembly (built with -Og): 

Thanks for the detailed explanation. Anyway, I do not expect this
function to become a performance bottleneck.

> This other method has a slightly not
> uniform distribution, but only runs rand() 70 times: 
> 
> Which is probably okay for this case.

I also think it is good enough.

> Then, I think using our own set may be the most readable solution here,
> which also allows the quick module method. 

I agree.

Best regards,

Xavi

On 1/9/24 16:12, Rodrigo Arias wrote:
> Hi Xavier,
> 
>> Thank you. I am still unfamiliar with that part of Dillo, so please let
>> me know about any progress.
> 
> For now I'm still writing the RFC and doing some proof of concepts. I
> can see that it will take a while.
> 
>> Limiting ourselves to a-z, A-Z and 0-9 would still account for 62 out of
>> the 75 possible characters, so roughly 82% of the set. I think that
>> removing the quoting in favour of the limited set reduce the risk for
>> broken implementations, yet still provide a good amount of randomness.
> 
> Yes, I think so too.
> 
>> I am not sure whether this was an intentional modification from your
>> side. My patch is adding a <space> as defined by POSIX.1-2017 [1], so
>> that sizeof " " would always return 2. Was it your intention to flag
>> this potential confusion?
> 
> Yes, my point was that it is that is not easy to determine the length of
> a UTF-8 string by just looking at it.
> 
>> Also, there was not strict reason to use sizeof " ". Any other
>> character would do e.g.: sizeof "x", sizeof "A", etc.
> 
> Same problem:
> 
> ᕁ 𝓍 𝙭 х 𝐱 𝗑 ⤫ 𝑥 𝘅 ⤬ ᙮ ⨯ 𝕩 𝖝 × 𝔁 𝚡 x x ⅹ 𝔵 ᕽ 𝒙 𝘹
> 
> 𝙰 A 𝐴 ᗅ 𝑨 𝚨 𝕬 𝖠 А Α 𝛢 𝝖 𝒜 𝜜 𖽀 𝓐 𝞐 A 𝗔 𝘈 Ꭺ ꓮ 𝐀 𝔄 𝔸 𐊠 𝘼
> 
> Check: https://util.unicode.org/UnicodeJsps/confusables.jsp
> 
> It is generally safer to use the explicit length.
> 
>>> You can also use dStr_append_c() to only append one character, so you
>>> only need a single character.
>>
>> That would be an unnecessary use of the heap, because the size is static.
> 
> Not sure what you mean with "unnecessary use of the heap". I meant
> something like this:
> 
>   static void generate_boundary(Dstr *boundary)
>   {
>      for (int i = 0; i < 70; i++) {
>         /* Extracted from RFC 2046, section 5.1.1. */
>         static const char set[] = "abcdefghijklmnopqrstuvwxyz"
>            "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
>            "0123456789";
>         int c;
> 
>         do {
>            c = rand() & 0xff;
>         } while (!strchr(set, c));
> 
>         dStr_append_c(boundary, c);
>      }
>   }
> 
> The variable c is stored in a register, as you can see in the
> disassembly (built with -Og):
> 
> hop% r2 build/src/dillo
> WARN: Relocs has not been applied. Please use `-e bin.relocs.apply=true`
> or `-e bin.cache=true` next time
> [0x000e3860]> s sym.generate_boundary_Dstr_
> [0x0012cf99]> af
> [0x0012cf99]> pdf
> ┌ 85: sym.generate_boundary_Dstr_ (int64_t arg1);
> │           ; arg int64_t arg1 @ rdi
> │           0x0012cf99      55             push rbp                    ;
> Fl_Pixmap.H:1250 ; generate_boundary(Dstr*)
> │           0x0012cf9a      4889e5         mov rbp, rsp
> │           0x0012cf9d      4155           push r13
> │           0x0012cf9f      4154           push r12
> │           0x0012cfa1      53             push rbx
> │           0x0012cfa2      4883ec08       sub rsp, 8
> │           0x0012cfa6      4989fd         mov r13, rdi                ;
> arg1
> │           0x0012cfa9      41bc00000000   mov r12d, 0                 ;
> Fl_Pixmap.H:1251
> │       ┌─< 0x0012cfaf      eb2c           jmp 0x12cfdd
> │     ┌┌──> 0x0012cfb1      ff1509cd1e00   call qword [reloc.rand]     ;
> Fl_Pixmap.H:1253 ; [0x319cc0:8]=0
> │     ╎╎│   0x0012cfb7      0fb6d8         movzx ebx, al               ;
> <--- Perform the "& 0xff" by zero-extending the lowest byte in eax
> │     ╎╎│   0x0012cfba      89de           mov esi, ebx                ;
> Fl_Pixmap.H:1260
> │     ╎╎│   0x0012cfbc      488d3dbd5b..   lea rdi,
> obj.generate_boundary_Dstr_::set ; 0x2a2b80 ;
> "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
> │     ╎╎│   0x0012cfc3      ff15afc51e00   call qword [reloc.strchr]   ;
> [0x319578:8]=0
> │     ╎╎│   0x0012cfc9      4885c0         test rax, rax
> │     └───< 0x0012cfcc      74e3           je 0x12cfb1
> │      ╎│   0x0012cfce      89de           mov esi, ebx                ;
> Fl_Pixmap.H:1262
> │      ╎│   0x0012cfd0      4c89ef         mov rdi, r13
> │      ╎│   0x0012cfd3      67e830570800   call sym.dStr_append_c
> │      ╎│   0x0012cfd9      4183c401       add r12d, 1                 ;
> Fl_Pixmap.H:1251
> │      ╎│   ; CODE XREF from generate_boundary(Dstr*) @ 0x12cfaf(x)
> │      ╎└─> 0x0012cfdd      4183fc45       cmp r12d, 0x45              ;
> 'E'
> │      └──< 0x0012cfe1      7ece           jle 0x12cfb1
> │           0x0012cfe3      4883c408       add rsp, 8                  ;
> Fl_Pixmap.H:1264
> │           0x0012cfe7      5b             pop rbx
> │           0x0012cfe8      415c           pop r12
> │           0x0012cfea      415d           pop r13
> │           0x0012cfec      5d             pop rbp
> └           0x0012cfed      c3             ret
> 
> This method gives us a perfect uniform distribution when RAND_MAX is a
> power of 2, at the cost of executing rand() more times than required
> (70/(62/256) ≈ 289 on average). This other method has a slightly not
> uniform distribution, but only runs rand() 70 times:
> 
>   static void generate_boundary(Dstr *boundary)
>   {
>      /* Extracted from RFC 2046, section 5.1.1. */
>      static const char set[] = "abcdefghijklmnopqrstuvwxyz"
>         "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
>         "0123456789";
>      static const int n = strlen(set);
> 
>      for (int i = 0; i < 70; i++) {
>         int c = (unsigned char) set[rand() % n];
>         dStr_append_c(boundary, c);
>      }
>   }
> 
> Which is probably okay for this case.
> 
> Notice I haven't tested any of these methods yet, I'll need to add a
> form upload test case to be able to see them in action (or a unit test).
> 
>>> If we only use alphanumeric characters, we can just use isalnum() right?
>>
>> According to POSIX.1-2017 [2], isalnum(3) depends on the current locale
>> configured by the system. For example, characters such as Ä or ú could
>> return non-zero.
> 
> Oh right, for some reason I was thinking this was the other way around,
> and isalnum only worked with ASCII. I think we should review the other
> uses of isalnum and friends as I think there may be used under similar
> assumptions.
> 
>> To avoid this, there are two possible solutions:
>>
>> 1. Use isalnum_l(3) to specify a locale_t object corresponding to the
>> "POSIX" locale (equivalent to "C" [3]), which must be previously
>> allocated by the newlocale(3) function [3] and released by the
>> freelocal(3) function [4]. A minimalist example is shown below:
>>
>>        locale_t l = newlocale(LC_CTYPE, "POSIX", NULL);
>>
>>        for (unsigned char i = 0; i < 255; i++)
>>                printf("hhu=%hhu, c=%c, isalnum=%d\n", i, i,
>> isalnum_l(i, l));
>>
>>        freelocale(l);
>>
>> 2. Define a known subset from the portable character set defined by
>> POSIX.1-2017 [5] and use strspn(3), as already suggested by the patch.
>> IMHO this approach is better because:
>>     - It does not deal with locales, so developers not familiar with them
>> would understand the code better.
>>     - It is also portable outside a POSIX environment (not sure if this a
>> requirement, though).
>>     - It does not require dynamic allication via newlocale(3).
>>     - It is the only possible option if non-alnum characters, such as ':'
>> or '/', are appended to the boundary string.
> 
> Then, I think using our own set may be the most readable solution here,
> which also allows the quick module method.
> 
> Best,
> Rodrigo
> _______________________________________________
> Dillo-dev mailing list -- dillo-dev@mailman3.com
> To unsubscribe send an email to dillo-dev-le...@mailman3.com

Attachment: OpenPGP_0x84FF3612A9BF43F2.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature

_______________________________________________
Dillo-dev mailing list -- dillo-dev@mailman3.com
To unsubscribe send an email to dillo-dev-le...@mailman3.com

Reply via email to