[Dillo-dev] Re: Issues with HTTP multipart/form-data file upload

Xavier Del Campo Romero Fri, 30 Aug 2024 06:08:07 -0700

Hi Rodrigo,

> Dillo has a mechanism to read chunks of data from different sources as they 
> are arriving and pass them to the next stage for processing. However, AFAIK 
> it always reads a chunk and appends it to a large buffer. It doesn't free the 
> processed part until is done with the whole thing.
> 
> This would require a change in the way Dillo processes data, but I think it 
> would be required for large files. There are more details in the
> devdoc/CCCwork.txt file and in src/chain.c if you want to take a closer look.
> 
> As I'm planning to change the design of the CCC, I think I can take this into 
> account too so it would be doable. I'll add it to the list of shortcomings of 
> the current design.


Thank you. I am still unfamiliar with that part of Dillo, so please let
me know about any progress.

> Okay, I'll focus on the boundary patch first, which is the easiest to merge 
> and then I'll take a closer look at the others.
>
> Yeah, I would assume a lot of implementations are broken, so we want to try 
> to minimize the chances we run into problems.

Limiting ourselves to a-z, A-Z and 0-9 would still account for 62 out of
the 75 possible characters, so roughly 82% of the set. I think that
removing the quoting in favour of the limited set reduce the risk for
broken implementations, yet still provide a good amount of randomness.

> Check sizeof " ": https://godbolt.org/z/7Tso8ooYz

Interestingly, the " " character on your last email is not really a
<space> (<U0020>):

$ printf "%s" " " | hd
00000000  e2 80 88                                          |...|
00000003

Compared to an ASCII whitespace:

$ printf "%s" " " | hd
00000000  20                                                | |
00000001

Both Godbolt and my editor also flag that multi-byte character with a
yellow rectangle around it because it would be highly confusing
otherwise. For example:

printf("len=%zu\n", strlen(" "));

Confusingly returns "len=3".

I am not sure whether this was an intentional modification from your
side. My patch is adding a <space> as defined by POSIX.1-2017 [1], so
that sizeof " " would always return 2. Was it your intention to flag
this potential confusion?

Also, there was not strict reason to use sizeof " ". Any other character
would do e.g.: sizeof "x", sizeof "A", etc.

> You can also use dStr_append_c() to only append one character, so you only 
> need a single character. 

That would be an unnecessary use of the heap, because the size is static.

> If we only use alphanumeric characters, we can just use isalnum() right? 

According to POSIX.1-2017 [2], isalnum(3) depends on the current locale
configured by the system. For example, characters such as Ä or ú could
return non-zero. To avoid this, there are two possible solutions:

1. Use isalnum_l(3) to specify a locale_t object corresponding to the
"POSIX" locale (equivalent to "C" [3]), which must be previously
allocated by the newlocale(3) function [3] and released by the
freelocal(3) function [4]. A minimalist example is shown below:

        locale_t l = newlocale(LC_CTYPE, "POSIX", NULL);

        for (unsigned char i = 0; i < 255; i++)
                printf("hhu=%hhu, c=%c, isalnum=%d\n", i, i,
isalnum_l(i, l));

        freelocale(l);

2. Define a known subset from the portable character set defined by
POSIX.1-2017 [5] and use strspn(3), as already suggested by the patch.
IMHO this approach is better because:
        - It does not deal with locales, so developers not familiar with them
would understand the code better.
        - It is also portable outside a POSIX environment (not sure if this a
requirement, though).
        - It does not require dynamic allication via newlocale(3).
        - It is the only possible option if non-alnum characters, such as ':'
or '/', are appended to the boundary string.

[1]:
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html
[2]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/isalnum.html
[3]:
https://pubs.opengroup.org/onlinepubs/9699919799/functions/newlocale.html
[4]:
https://pubs.opengroup.org/onlinepubs/9699919799/functions/freelocale.html
[5]:
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html

> I meant when is the next KoVoꓘ concert :-) 

No gigs ahead, but I will keep you informed. :)

Best regards,

Xavi

On 28/8/24 22:47, Rodrigo Arias wrote:
> Hi Xavier,
> 
> On Wed, Aug 28, 2024 at 01:04:04AM +0200, Xavier Del Campo Romero wrote:
>> Hi Rodrigo,
>>
>>> Glad to read that you also consider Dillo for slcl, and thanks for
>>> preparing the patches :-)
>>
>> Thank you! I want slcl to be useful to anyone, including users who care
>> about minimalist software like Dillo. The web is already too crowded
>> with bloated "webapps" and other terrible things. :)
> 
> Agreed!
> 
>>> Sounds good, not sure how complicated it would be to do this.
>>
>> I still need to investigate this further, but I assume this would
>> require Dillo to at least implement a sink callback.
>>
>> In other words, the component responsible for transmitting the data
>> (probably src/IO/IO.c) should trigger a user-defined callback with an
>> arbitrarily-sized buffer (typically, of BUFSIZ bytes, as defined by
>> stdio.h) that must filled with file data. Then, the user-defined
>> callback can fill from zero up to BUFSIZ bytes, which are eventually
>> trasmitted to the server.
> 
> Dillo has a mechanism to read chunks of data from different sources as
> they are arriving and pass them to the next stage for processing.
> However, AFAIK it always reads a chunk and appends it to a large buffer.
> It doesn't free the processed part until is done with the whole thing.
> 
> This would require a change in the way Dillo processes data, but I think
> it would be required for large files. There are more details in the
> devdoc/CCCwork.txt file and in src/chain.c if you want to take a closer
> look.
> 
> As I'm planning to change the design of the CCC, I think I can take this
> into account too so it would be doable. I'll add it to the list of
> shortcomings of the current design.
> 
>> That said, I am still not sure how much actual effort this would take.
>> But I am glad to receive positive feedback so far - I will then continue
>> to find a solution.
>>
>>> However, being able to upload multiple files at the same time sounds
>>> reasonable, so feel free to try on your own in the meanwhile.
>>
>> Uploading multiple files at once seems doable - the patches I sent on my
>> previous email are probably already doing most of the required work.
>> Again, the trickiest task is to send data on-the-fly for each selected
>> file.
> 
> Okay, I'll focus on the boundary patch first, which is the easiest to
> merge and then I'll take a closer look at the others.
> 
>>
>>> Shouldn't it be 68 then?
>>
>> I understand the opposite: the boundary string with the two leading
>> dashes ("--") included can be up to 72 bytes long, and 74 bytes long for
>> the ending boundary (which includes two more dashes after the boundary
>> string). This is confirmed by reading the BNF defined by RFC 2046 (some
>> bits omitted for simplicity), section 5.1.1 [1]:
>>
>>> boundary := 0*69<bchars> bcharsnospace
>>> bchars := bcharsnospace / " "
>>> bcharsnospace := DIGIT / ALPHA / "'" / "(" / ")" /
>>>                       "+" / "_" / "," / "-" / "." /
>>>                       "/" / ":" / "=" / "?"
>>> dash-boundary := "--" boundary
>>>                       ; boundary taken from the value of
>>>                       ; boundary parameter of the
>>>                       ; Content-Type field> multipart-body :=
>>> [preamble CRLF]
>>>                        dash-boundary transport-padding CRLF
>>>                        body-part *encapsulation
>>>                        close-delimiter transport-padding
>>>                        [CRLF epilogue]
>>> delimiter := CRLF dash-boundary
>>> close-delimiter := delimiter "--"
> 
> Oh right! I see that we are already using 70 characters anyway.
> 
>> Note: even if the specification tells receivers to handle transport
>> padding, for the time being I am assuming "transport-padding" as zero
>> length since composers must not generate non-zero length transport
>> padding. I am still not sure where transport padding would apply,
>> anyway. Probably outside web browsers?
>>
>>> I would leave out all the symbols to avoid quoting and only use A-Z
>>> a-z and 0-9.
>>
>> Interestingly, Dillo would always quote boundary strings [2], even if
>> only using A-Z, a-z and 0-9. In fact, this is one of the wrong
>> assumptions I spotted when testing slcl against Dillo.
> 
> Yeah, I would assume a lot of implementations are broken, so we want to
> try to minimize the chances we run into problems.
> 
> Apart from slcl we should also test this with some sites and see if they
> continue to work okay.
> 
> This will also increase the fingerprinting information to distinguish
> Dillo among other browsers, but I think it is not more information that
> the already leaked by the user agent.
> 
>>> Which, if I computed it correctly, is still too small to worry about.
>>
>> Not only it is too small of a chance: if we really wanted to do "the
>> right thing" and make Dillo absolutely sure the boundary string is not
>> contained within the selected files, this would imply a noticeable
>> performance impact when dealing with large files, much likely for a
>> near-zero benefit.
>>
>> I have not inspected their source code yet (and I do not want to), but I
>> understand both Gecko and Chromium are also making that assumption,
>> because otherwise it would take them a lot of CPU time to upload large
>> files.
> 
> But then they would be doing such assumption with a "much larger"
> probability it hits the file.
> 
> Skipping it with 70 characters is safe for one file, but also probably
> safe for all files ever uploaded with Dillo.
> 
> Maybe curl or other small codebases are easier to read, but not really
> needed.
> 
>>
>>> Why sizeof " " instead of just 2?
>>
>> Because, to my eyes, sizeof " " has more meaningful semantics, compared
>> to a magic integer constant such as 2. However, for this simple
>> scenario, I would still consider both acceptable.
> 
> Check sizeof " ": https://godbolt.org/z/7Tso8ooYz
> 
> You can also use dStr_append_c() to only append one character, so you
> only need a single character.
> 
> If we only use alphanumeric characters, we can just use isalnum() right?
> 
>> I can replace it with 2 if you find the other construct unacceptable.
>>
>>> PS: When are you playing?
>>
>> Sorry, I did not understand your last sentence. Could you please give a
>> bit more context? :)
> 
> I meant when is the next KoVoꓘ concert :-)
> 
> Best,
> Rodrigo.
> _______________________________________________
> Dillo-dev mailing list -- dillo-dev@mailman3.com
> To unsubscribe send an email to dillo-dev-le...@mailman3.com

OpenPGP_0x84FF3612A9BF43F2.asc
Description: OpenPGP public key

OpenPGP_signature.asc
Description: OpenPGP digital signature

_______________________________________________
Dillo-dev mailing list -- dillo-dev@mailman3.com
To unsubscribe send an email to dillo-dev-le...@mailman3.com

[Dillo-dev] Re: Issues with HTTP multipart/form-data file upload

Reply via email to