[ 
https://issues.apache.org/jira/browse/CODEC-339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruiqi Dong updated CODEC-339:
-----------------------------
    Description: 
*Summary*
URLCodec.encodeUrl(BitSet, byte[]) accepts a caller-provided safe-character 
set, but it does not document or reject the URL escape character '%' as a 
forbidden entry. If the supplied BitSet marks '%' as safe, the encoder emits a 
literal '%'. URLCodec.decodeUrl(byte[]) always interprets '%' as the start of 
an escape sequence, so the API accepts a configuration that can produce output 
the paired decoder cannot consume.
 
*Affected code*
File: src/main/java/org/apache/commons/codec/net/URLCodec.java
{code:java}
if (b == '+') {
    buffer.write(' ');
} else if (b == ESCAPE_CHAR) {
    ...
} {code}
{code:java}
if (urlsafe.get(b)) {
    if (b == ' ') {
        b = '+';
    }
    buffer.write(b);
} else {
    buffer.write(ESCAPE_CHAR);
    ...
}{code}
The decoder always reserves '%', but the encoder allows a caller-provided 
BitSet to treat '%' as a literal output byte.
 
*Reproducer*
Add the following test to 
src/test/java/org/apache/commons/codec/net/URLCodecTest.java:
{code:java}
@Test
void testEncodeUrlMustEscapeLiteralPercentEvenIfMarkedSafe() throws Exception {
    final BitSet safe = new BitSet();
    safe.set('%');
    final byte[] encoded = URLCodec.encodeUrl(safe, 
"%".getBytes(StandardCharsets.US_ASCII));
    assertArrayEquals("%25".getBytes(StandardCharsets.US_ASCII), encoded,
            "The URL escape character must not be emitted literally");
    assertArrayEquals("%".getBytes(StandardCharsets.US_ASCII), 
URLCodec.decodeUrl(encoded),
            "Round-trip should preserve a literal percent sign");
} {code}
Run:
{code:java}
mvn -q 
-Dtest=org.apache.commons.codec.net.URLCodecTest#testEncodeUrlMustEscapeLiteralPercentEvenIfMarkedSafe
 test {code}
Observed behavior:
The accepted configuration produces the encoded form %, and decoding that 
output throws:
{code:java}
org.apache.commons.codec.DecoderException: Invalid URL encoding: {code}
Expected behavior:
The public API should choose one consistent behavior and make it explicit: * 
document that '%' must never be marked safe in the caller-provided BitSet
 * reject such a BitSet up front, for example with IllegalArgumentException
 * or keep '%' reserved regardless of the `BitSet` and always emit %25

This is a public API contract gap and a semantic mismatch between the paired 
encode/decode entry points:
 * encodeUrl(...) can emit a literal '%'
 * decodeUrl(...) always treats '%' as control syntax

The API exposes a configurable safe-character set but does not tell callers 
that '%' is a reserved value, and it does not reject that configuration either. 
As a result, a supported public entry point accepts input that can generate 
undecodable output.

  was:
*Summary*
URLCodec.encodeUrl(BitSet, byte[]) accepts a caller-provided safe-character 
set, but it does not document or reject the URL escape character '%' as a 
forbidden entry. If the supplied BitSet marks '%' as safe, the encoder emits a 
literal '%'. URLCodec.decodeUrl(byte[]) always interprets '%' as the start of 
an escape sequence, so the API accepts a configuration that can produce output 
the paired decoder cannot consume.
 
*Affected code*
File: src/main/java/org/apache/commons/codec/net/URLCodec.java
{code:java}
if (b == '+') {
    buffer.write(' ');
} else if (b == ESCAPE_CHAR) {
    ...
} {code}
{code:java}
if (urlsafe.get(b)) {
    if (b == ' ') {
        b = '+';
    }
    buffer.write(b);
} else {
    buffer.write(ESCAPE_CHAR);
    ...
}{code}
The decoder always reserves '%', but the encoder allows a caller-provided 
BitSet to treat '%' as a literal output byte.
 
*Reproducer*
Add the following test to 
src/test/java/org/apache/commons/codec/net/URLCodecTest.java:
{code:java}
@Test
void testEncodeUrlMustEscapeLiteralPercentEvenIfMarkedSafe() throws Exception {
    final BitSet safe = new BitSet();
    safe.set('%');
    final byte[] encoded = URLCodec.encodeUrl(safe, 
"%".getBytes(StandardCharsets.US_ASCII));
    assertArrayEquals("%25".getBytes(StandardCharsets.US_ASCII), encoded,
            "The URL escape character must not be emitted literally");
    assertArrayEquals("%".getBytes(StandardCharsets.US_ASCII), 
URLCodec.decodeUrl(encoded),
            "Round-trip should preserve a literal percent sign");
} {code}
Run:
{code:java}
mvn -q 
-Dtest=org.apache.commons.codec.net.URLCodecTest#testEncodeUrlMustEscapeLiteralPercentEvenIfMarkedSafe
 test {code}
Observed behavior:
The accepted configuration produces the encoded form %, and decoding that 
output throws:
{code:java}
org.apache.commons.codec.DecoderException: Invalid URL encoding: {code}
Expected behavior:
The public API should choose one consistent behavior and make it explicit: * 
document that '%' must never be marked safe in the caller-provided BitSet
 * reject such a BitSet up front, for example with IllegalArgumentException
 * or keep '%' reserved regardless of the `BitSet` and always emit %25



This is a public API contract gap and a semantic mismatch between the paired 
encode/decode entry points: * encodeUrl(...) can emit a literal '%'
 * decodeUrl(...) always treats '%' as control syntax
The API exposes a configurable safe-character set but does not tell callers 
that '%' is a reserved value, and it does not reject that configuration either. 
As a result, a supported public entry point accepts input that can generate 
undecodable output.


> URLCodec.encodeUrl(BitSet, ...) accepts a BitSet that conflicts with 
> decodeUrl(...)
> -----------------------------------------------------------------------------------
>
>                 Key: CODEC-339
>                 URL: https://issues.apache.org/jira/browse/CODEC-339
>             Project: Commons Codec
>          Issue Type: Bug
>            Reporter: Ruiqi Dong
>            Priority: Minor
>
> *Summary*
> URLCodec.encodeUrl(BitSet, byte[]) accepts a caller-provided safe-character 
> set, but it does not document or reject the URL escape character '%' as a 
> forbidden entry. If the supplied BitSet marks '%' as safe, the encoder emits 
> a literal '%'. URLCodec.decodeUrl(byte[]) always interprets '%' as the start 
> of an escape sequence, so the API accepts a configuration that can produce 
> output the paired decoder cannot consume.
>  
> *Affected code*
> File: src/main/java/org/apache/commons/codec/net/URLCodec.java
> {code:java}
> if (b == '+') {
>     buffer.write(' ');
> } else if (b == ESCAPE_CHAR) {
>     ...
> } {code}
> {code:java}
> if (urlsafe.get(b)) {
>     if (b == ' ') {
>         b = '+';
>     }
>     buffer.write(b);
> } else {
>     buffer.write(ESCAPE_CHAR);
>     ...
> }{code}
> The decoder always reserves '%', but the encoder allows a caller-provided 
> BitSet to treat '%' as a literal output byte.
>  
> *Reproducer*
> Add the following test to 
> src/test/java/org/apache/commons/codec/net/URLCodecTest.java:
> {code:java}
> @Test
> void testEncodeUrlMustEscapeLiteralPercentEvenIfMarkedSafe() throws Exception 
> {
>     final BitSet safe = new BitSet();
>     safe.set('%');
>     final byte[] encoded = URLCodec.encodeUrl(safe, 
> "%".getBytes(StandardCharsets.US_ASCII));
>     assertArrayEquals("%25".getBytes(StandardCharsets.US_ASCII), encoded,
>             "The URL escape character must not be emitted literally");
>     assertArrayEquals("%".getBytes(StandardCharsets.US_ASCII), 
> URLCodec.decodeUrl(encoded),
>             "Round-trip should preserve a literal percent sign");
> } {code}
> Run:
> {code:java}
> mvn -q 
> -Dtest=org.apache.commons.codec.net.URLCodecTest#testEncodeUrlMustEscapeLiteralPercentEvenIfMarkedSafe
>  test {code}
> Observed behavior:
> The accepted configuration produces the encoded form %, and decoding that 
> output throws:
> {code:java}
> org.apache.commons.codec.DecoderException: Invalid URL encoding: {code}
> Expected behavior:
> The public API should choose one consistent behavior and make it explicit: * 
> document that '%' must never be marked safe in the caller-provided BitSet
>  * reject such a BitSet up front, for example with IllegalArgumentException
>  * or keep '%' reserved regardless of the `BitSet` and always emit %25
> This is a public API contract gap and a semantic mismatch between the paired 
> encode/decode entry points:
>  * encodeUrl(...) can emit a literal '%'
>  * decodeUrl(...) always treats '%' as control syntax
> The API exposes a configurable safe-character set but does not tell callers 
> that '%' is a reserved value, and it does not reject that configuration 
> either. As a result, a supported public entry point accepts input that can 
> generate undecodable output.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to