Hi Rodrigo, > Not sure what you mean with "unnecessary use of the heap". I meant > something like this:
You are right: I was too fixated with strcspn(3), when in fact strchr(3) would already do. Also, this use of dStr_append_c() looks good to me. > The variable c is stored in a register, as you can see in the > disassembly (built with -Og): Thanks for the detailed explanation. Anyway, I do not expect this function to become a performance bottleneck. > This other method has a slightly not > uniform distribution, but only runs rand() 70 times: > > Which is probably okay for this case. I also think it is good enough. > Then, I think using our own set may be the most readable solution here, > which also allows the quick module method. I agree. Best regards, Xavi On 1/9/24 16:12, Rodrigo Arias wrote: > Hi Xavier, > >> Thank you. I am still unfamiliar with that part of Dillo, so please let >> me know about any progress. > > For now I'm still writing the RFC and doing some proof of concepts. I > can see that it will take a while. > >> Limiting ourselves to a-z, A-Z and 0-9 would still account for 62 out of >> the 75 possible characters, so roughly 82% of the set. I think that >> removing the quoting in favour of the limited set reduce the risk for >> broken implementations, yet still provide a good amount of randomness. > > Yes, I think so too. > >> I am not sure whether this was an intentional modification from your >> side. My patch is adding a <space> as defined by POSIX.1-2017 [1], so >> that sizeof " " would always return 2. Was it your intention to flag >> this potential confusion? > > Yes, my point was that it is that is not easy to determine the length of > a UTF-8 string by just looking at it. > >> Also, there was not strict reason to use sizeof " ". Any other >> character would do e.g.: sizeof "x", sizeof "A", etc. > > Same problem: > > ᕁ 𝓍 𝙭 х 𝐱 𝗑 ⤫ 𝑥 𝘅 ⤬ ᙮ ⨯ 𝕩 𝖝 × 𝔁 𝚡 x x ⅹ 𝔵 ᕽ 𝒙 𝘹 > > 𝙰 A 𝐴 ᗅ 𝑨 𝚨 𝕬 𝖠 А Α 𝛢 𝝖 𝒜 𝜜 𖽀 𝓐 𝞐 A 𝗔 𝘈 Ꭺ ꓮ 𝐀 𝔄 𝔸 𐊠 𝘼 > > Check: https://util.unicode.org/UnicodeJsps/confusables.jsp > > It is generally safer to use the explicit length. > >>> You can also use dStr_append_c() to only append one character, so you >>> only need a single character. >> >> That would be an unnecessary use of the heap, because the size is static. > > Not sure what you mean with "unnecessary use of the heap". I meant > something like this: > > static void generate_boundary(Dstr *boundary) > { > for (int i = 0; i < 70; i++) { > /* Extracted from RFC 2046, section 5.1.1. */ > static const char set[] = "abcdefghijklmnopqrstuvwxyz" > "ABCDEFGHIJKLMNOPQRSTUVWXYZ" > "0123456789"; > int c; > > do { > c = rand() & 0xff; > } while (!strchr(set, c)); > > dStr_append_c(boundary, c); > } > } > > The variable c is stored in a register, as you can see in the > disassembly (built with -Og): > > hop% r2 build/src/dillo > WARN: Relocs has not been applied. Please use `-e bin.relocs.apply=true` > or `-e bin.cache=true` next time > [0x000e3860]> s sym.generate_boundary_Dstr_ > [0x0012cf99]> af > [0x0012cf99]> pdf > ┌ 85: sym.generate_boundary_Dstr_ (int64_t arg1); > │ ; arg int64_t arg1 @ rdi > │ 0x0012cf99 55 push rbp ; > Fl_Pixmap.H:1250 ; generate_boundary(Dstr*) > │ 0x0012cf9a 4889e5 mov rbp, rsp > │ 0x0012cf9d 4155 push r13 > │ 0x0012cf9f 4154 push r12 > │ 0x0012cfa1 53 push rbx > │ 0x0012cfa2 4883ec08 sub rsp, 8 > │ 0x0012cfa6 4989fd mov r13, rdi ; > arg1 > │ 0x0012cfa9 41bc00000000 mov r12d, 0 ; > Fl_Pixmap.H:1251 > │ ┌─< 0x0012cfaf eb2c jmp 0x12cfdd > │ ┌┌──> 0x0012cfb1 ff1509cd1e00 call qword [reloc.rand] ; > Fl_Pixmap.H:1253 ; [0x319cc0:8]=0 > │ ╎╎│ 0x0012cfb7 0fb6d8 movzx ebx, al ; > <--- Perform the "& 0xff" by zero-extending the lowest byte in eax > │ ╎╎│ 0x0012cfba 89de mov esi, ebx ; > Fl_Pixmap.H:1260 > │ ╎╎│ 0x0012cfbc 488d3dbd5b.. lea rdi, > obj.generate_boundary_Dstr_::set ; 0x2a2b80 ; > "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789" > │ ╎╎│ 0x0012cfc3 ff15afc51e00 call qword [reloc.strchr] ; > [0x319578:8]=0 > │ ╎╎│ 0x0012cfc9 4885c0 test rax, rax > │ └───< 0x0012cfcc 74e3 je 0x12cfb1 > │ ╎│ 0x0012cfce 89de mov esi, ebx ; > Fl_Pixmap.H:1262 > │ ╎│ 0x0012cfd0 4c89ef mov rdi, r13 > │ ╎│ 0x0012cfd3 67e830570800 call sym.dStr_append_c > │ ╎│ 0x0012cfd9 4183c401 add r12d, 1 ; > Fl_Pixmap.H:1251 > │ ╎│ ; CODE XREF from generate_boundary(Dstr*) @ 0x12cfaf(x) > │ ╎└─> 0x0012cfdd 4183fc45 cmp r12d, 0x45 ; > 'E' > │ └──< 0x0012cfe1 7ece jle 0x12cfb1 > │ 0x0012cfe3 4883c408 add rsp, 8 ; > Fl_Pixmap.H:1264 > │ 0x0012cfe7 5b pop rbx > │ 0x0012cfe8 415c pop r12 > │ 0x0012cfea 415d pop r13 > │ 0x0012cfec 5d pop rbp > └ 0x0012cfed c3 ret > > This method gives us a perfect uniform distribution when RAND_MAX is a > power of 2, at the cost of executing rand() more times than required > (70/(62/256) ≈ 289 on average). This other method has a slightly not > uniform distribution, but only runs rand() 70 times: > > static void generate_boundary(Dstr *boundary) > { > /* Extracted from RFC 2046, section 5.1.1. */ > static const char set[] = "abcdefghijklmnopqrstuvwxyz" > "ABCDEFGHIJKLMNOPQRSTUVWXYZ" > "0123456789"; > static const int n = strlen(set); > > for (int i = 0; i < 70; i++) { > int c = (unsigned char) set[rand() % n]; > dStr_append_c(boundary, c); > } > } > > Which is probably okay for this case. > > Notice I haven't tested any of these methods yet, I'll need to add a > form upload test case to be able to see them in action (or a unit test). > >>> If we only use alphanumeric characters, we can just use isalnum() right? >> >> According to POSIX.1-2017 [2], isalnum(3) depends on the current locale >> configured by the system. For example, characters such as Ä or ú could >> return non-zero. > > Oh right, for some reason I was thinking this was the other way around, > and isalnum only worked with ASCII. I think we should review the other > uses of isalnum and friends as I think there may be used under similar > assumptions. > >> To avoid this, there are two possible solutions: >> >> 1. Use isalnum_l(3) to specify a locale_t object corresponding to the >> "POSIX" locale (equivalent to "C" [3]), which must be previously >> allocated by the newlocale(3) function [3] and released by the >> freelocal(3) function [4]. A minimalist example is shown below: >> >> locale_t l = newlocale(LC_CTYPE, "POSIX", NULL); >> >> for (unsigned char i = 0; i < 255; i++) >> printf("hhu=%hhu, c=%c, isalnum=%d\n", i, i, >> isalnum_l(i, l)); >> >> freelocale(l); >> >> 2. Define a known subset from the portable character set defined by >> POSIX.1-2017 [5] and use strspn(3), as already suggested by the patch. >> IMHO this approach is better because: >> - It does not deal with locales, so developers not familiar with them >> would understand the code better. >> - It is also portable outside a POSIX environment (not sure if this a >> requirement, though). >> - It does not require dynamic allication via newlocale(3). >> - It is the only possible option if non-alnum characters, such as ':' >> or '/', are appended to the boundary string. > > Then, I think using our own set may be the most readable solution here, > which also allows the quick module method. > > Best, > Rodrigo > _______________________________________________ > Dillo-dev mailing list -- dillo-dev@mailman3.com > To unsubscribe send an email to dillo-dev-le...@mailman3.com
OpenPGP_0x84FF3612A9BF43F2.asc
Description: OpenPGP public key
OpenPGP_signature.asc
Description: OpenPGP digital signature
_______________________________________________ Dillo-dev mailing list -- dillo-dev@mailman3.com To unsubscribe send an email to dillo-dev-le...@mailman3.com