On Fri, 7 Jul 2023 19:17:59 GMT, Andy Goryachev <[email protected]> wrote:
>> modules/javafx.graphics/src/main/java/com/sun/javafx/util/DataURI.java line
>> 115:
>>
>>> 113: nameValuePairs,
>>> 114: base64,
>>> 115: base64 ? Base64.getDecoder().decode(data) :
>>> decodePercentEncoding(data));
>>
>> I wonder if this is all necessary. The data is supposed to be url-encoded,
>> so it's essentially ASCII, no?
>>
>> passing default charset to getBytes() is not right, it probably should be
>>
>> URLDecoder.decode(data.replace("+", "%2B"),
>> charset).getBytes(StandardCharsets.US_ASCII));
>>
>> or am I missing something?
>
> From https://datatracker.ietf.org/doc/html/rfc3986#page-11
>
>
> Therefore, the
>
>
>
>
>
> Berners-Lee, et al. Standards Track [Page 11]
>
> [RFC 3986](https://datatracker.ietf.org/doc/html/rfc3986)
> URI Generic Syntax January 2005
>
>
> integer values used by the ABNF must be mapped back to their
> corresponding characters via US-ASCII in order to complete the syntax
> rules.
> I wonder if this is all necessary. The data is supposed to be url-encoded, so
> it's essentially ASCII, no?
>
> passing default charset to getBytes() is not right, it probably should be
>
> URLDecoder.decode(data.replace("+", "%2B"),
> charset).getBytes(StandardCharsets.US_ASCII));
>
> or am I missing something?
The payload of a data URI is just a sequence of bytes, not characters. Only
when the numeric value of a byte, assuming ASCII encoding, is a *safe URL
character*, it is left as-is; otherwise percent-encoding is used to encode the
byte value. The [specification](https://datatracker.ietf.org/doc/html/rfc2397)
points out:
Without ";base64", the data (as a sequence of octets) is represented using
ASCII encoding for octets inside the range of safe URL characters and using
the standard %xx hex encoding of URLs for octets outside that range.
Decoding the payload back to a byte array is done by simply converting each
assumed ASCII character to its numeric value, and decoding percent-encoded
bytes to their hex value. Note that the assumed ASCII encoding only refers to
the URL, but not to the payload. The payload is not a string, and it doesn't
contain characters; it's a sequence of bytes.
`URLDecoder` is not a general-purpose class to decode a percent-encoded
sequence of bytes. It's specifically meant to take a HTML forms string and
decode it into a string with some defined charset, using additional rules that
don't generally apply to percent-encoded byte sequences. For example, a space
character is encoded as `+` (that's where the kind-of-hacky `data.replace("+",
"%2B")` comes from).
Using `URLDecoder` kind of works (if we use a sufficiently rich charset for
both `URLDecoder.decode` and `String.getBytes`), but only by accident. While it
accepts almost any percent-encoded data, the Javadoc for `URLDecoder` says:
There are two possible ways in which this decoder could deal with illegal
strings.
It could either leave illegal characters alone or it could throw an
IllegalArgumentException.
Which approach the decoder takes is left to the implementation.
-------------
PR Review Comment: https://git.openjdk.org/jfx/pull/1165#discussion_r1256426290