Ruiqi Dong created CODEC-339:
--------------------------------

             Summary: URLCodec.encodeUrl(BitSet, ...) accepts a BitSet that 
conflicts with decodeUrl(...)
                 Key: CODEC-339
                 URL: https://issues.apache.org/jira/browse/CODEC-339
             Project: Commons Codec
          Issue Type: Bug
            Reporter: Ruiqi Dong


*Summary*
URLCodec.encodeUrl(BitSet, byte[]) accepts a caller-provided safe-character 
set, but it does not document or reject the URL escape character '%' as a 
forbidden entry. If the supplied BitSet marks '%' as safe, the encoder emits a 
literal '%'. URLCodec.decodeUrl(byte[]) always interprets '%' as the start of 
an escape sequence, so the API accepts a configuration that can produce output 
the paired decoder cannot consume.
 
*Affected code*
File: src/main/java/org/apache/commons/codec/net/URLCodec.java
{code:java}
if (b == '+') {
    buffer.write(' ');
} else if (b == ESCAPE_CHAR) {
    ...
} {code}
{code:java}
if (urlsafe.get(b)) {
    if (b == ' ') {
        b = '+';
    }
    buffer.write(b);
} else {
    buffer.write(ESCAPE_CHAR);
    ...
}{code}
The decoder always reserves '%', but the encoder allows a caller-provided 
BitSet to treat '%' as a literal output byte.
 
*Reproducer*
Add the following test to 
src/test/java/org/apache/commons/codec/net/URLCodecTest.java:
{code:java}
@Test
void testEncodeUrlMustEscapeLiteralPercentEvenIfMarkedSafe() throws Exception {
    final BitSet safe = new BitSet();
    safe.set('%');
    final byte[] encoded = URLCodec.encodeUrl(safe, 
"%".getBytes(StandardCharsets.US_ASCII));
    assertArrayEquals("%25".getBytes(StandardCharsets.US_ASCII), encoded,
            "The URL escape character must not be emitted literally");
    assertArrayEquals("%".getBytes(StandardCharsets.US_ASCII), 
URLCodec.decodeUrl(encoded),
            "Round-trip should preserve a literal percent sign");
} {code}
Run:
{code:java}
mvn -q 
-Dtest=org.apache.commons.codec.net.URLCodecTest#testEncodeUrlMustEscapeLiteralPercentEvenIfMarkedSafe
 test {code}
Observed behavior:
The accepted configuration produces the encoded form %, and decoding that 
output throws:
{code:java}
org.apache.commons.codec.DecoderException: Invalid URL encoding: {code}
Expected behavior:
The public API should choose one consistent behavior and make it explicit: * 
document that '%' must never be marked safe in the caller-provided BitSet
 * reject such a BitSet up front, for example with IllegalArgumentException
 * or keep '%' reserved regardless of the `BitSet` and always emit %25



This is a public API contract gap and a semantic mismatch between the paired 
encode/decode entry points: * encodeUrl(...) can emit a literal '%'
 * decodeUrl(...) always treats '%' as control syntax
The API exposes a configurable safe-character set but does not tell callers 
that '%' is a reserved value, and it does not reject that configuration either. 
As a result, a supported public entry point accepts input that can generate 
undecodable output.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to