Thanks again Gustaf,

I can see the W3C spec you reference seems quite unequivocal in saying an
error message should be sent back when decoding invalid UTF-8 form data.

But I was curious why other implementations appear to use the UTF-8
replacement character (U+FFFD) instead, and found a bit of discussion in
the unicode standard itself [1] & [2].

[1] specifically refers to the WHATWG(W3C) spec for encoding/decoding [3]
which defines an "error" condition when decoding UTF-8 as being one of two
possible error modes:
Namely:

   - fatal - "return the error"
   - replacement - "Push U+FFFD (�) to output."

This aligns with the behaviour of, say, Python's bytes.decode() where the
default is to raise an error for encoding errors ("strict" error handling),
but optionally, you can specify "replace" error handling which will utilise
the U+FFFD character instead. I can see this working in cases where we're
told the data should be UTF-8, or where we're assuming by default it's
UTF-8.

But I'm not sure how much work this would be to implement and whether it is
seen as worthwhile to others?

As it stands, we have legacy applications which POSTs data to us which
regularly (and, by now, expectedly) sends invalid characters despite best
efforts to fix it.
I guess we would redirect the POSTs to another non-naviserver system,
sanitise the data there, then send it on to NaviServer, but it would be
nice to be able to deal with it within NaviServer itself.

[1] https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf (Section 3.9
"U+FFFD Substitution of Maximal Subparts")
[2] https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf (Section 5.22
"U+FFFD Substitution in Conversion")
[3] https://encoding.spec.whatwg.org/#decoder
[4] https://docs.python.org/3/library/stdtypes.html#bytes.decode

On Mon, 2 May 2022 at 13:30, Gustaf Neumann <neum...@wu.ac.at> wrote:

> Dear David and all,
>
> I looked into this issue, and I do not like the current situation either.
> In the current snapshot, a GET request with invalid coded
> query variables is rejected, while the POST request leads just
> to the warning, and the invalid entry is omitted.
>
> W3C [1] says in the reference for Multilingual form encoding:
> > If non-UTF-8 data is received, an error message should be sent back.
>
> This means, that the only defensible logic is to reject in both cases
> the request as invalid. One can certainly send single-byte funny character
> data in URLs, which is invalid UTF8 (e.g. "%9C" or "%E6" etc.),
> but for these requests, the charset has to be specified, either
> via content type, or via the default URL encoding in the NaviServer
> configuration... see example (2)  below.
>
> As mentioned earlier, there are increasingly many attacks with invalid
> UTF-8 data (also by vulnerability scanners), so we to be strict here.
>
> I will try to address the outstanding issues ASAP and provide then
> another RC.
>
> All the best
>
> -gn
>
> [1] https://www.w3.org/International/questions/qa-forms-utf-8
>
>
>  # POST request with already encoded form data (x-www-form-urlencoded)
>  $ curl -X POST -d "p1=a%C5%93Cb&p2=a%E6b" localhost:8100/upload.tcl
>
>  # POST request with already encoded form data, but proper encoding
>  $ curl -X POST -H "Content-Type: application/x-www-form-urlencoded; 
> charset=iso-8859-1" -d "p2=a%E6b" localhost:8100/upload.tcl
>
>  # POST + x-www-form-urlencoded, but let curl do the encoding
>  $ curl -X POST -d "p1=aœb" -d $(echo -e 'p2=a\xE6b') 
> localhost:8100/upload.tcl
>
>  # POST + multipart/form-data, let curl do the encoding
>  $ curl -X POST -F "p1=aœb" -F $(echo -e 'p2=a\xE6b') 
> localhost:8100/upload.tcl
>
>  POST request with already encoded form data (x-www-form-urlencoded)
>  $ curl -X GET  "localhost:8100/upload.tcl?p1=a%C5%93Cb&p2=a%E6b"
>
>
>
_______________________________________________
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel

Reply via email to