Hi Christian,

Thanks for raising this topic! I agree this is a significant source of
compatibility problems between Java and non-Java clients.

On the encoding side (client side), the fix is straightforward: encode
spaces as per RFC 3986, that is, encode a space as `%20`. Since all
servers, old and new, would correctly decode `%20` as a space, this
part of the fix is benign.

On the decoding side (server side), however, things are a bit complex.
A new server can't distinguish between:

  - "+" meaning space (from an old Java client)
  - "+" meaning literal "+" (from any RFC 3986-compliant client)

Interpreting as space breaks non-Java clients (as happens today), but
interpreting as literal "+"  would then break all Java clients that
weren't updated.

I therefore suggest a phased approach:

1) Release N: Introduce a new encoding method in RESTUtil and "wire it
up" immediately to ResourcePaths and RESTUtil.encodeNamespace. This
would immediately benefit non-Java servers.

2) Release N: For the decoding-side, introduce a new decoding method
in RESTUtil, but leave it "unwired" for now. In particular,
RESTUtil.decodeNamespace should keep using the old decode method.
Java-based servers should also keep using the old method in order to
not disrupt existing Java clients.

3) Release N+M: A few releases later, change RESTUtil.decodeNamespace
to use the new decoding method (behavioral change). At this point,
each Java-based server should also adopt the new decoding method and
become RFC compliant. Servers are of course free to adopt the new
decoding method earlier if that's acceptable for them, e.g. by using a
feature flag.

As a side note, RESTUtil.decodeString and RESTUtil.decodeNamespace are
hard to use by modern Java servers because generally, the path is
decoded by the REST framework. Apache Polaris, for instance, is forced
to go through a whole convoluted process to be able to use this
method. We could maybe seize the opportunity to also provide a better
way of decoding namespaces.

Finally, the tests you pointed at clearly need to be revisited as they
are implicitly validating the wrong decoding behavior.

Thanks,
Alex


On Tue, Apr 14, 2026 at 7:58 AM Eduard Tudenhöfner
<[email protected]> wrote:
>
> Thanks for bringing this up Christian. I support fixing this in the Java 
> client, provided the fix is fully backward-compatible with older clients and 
> servers. We should probably add a separate method in RESTUtil to encode path 
> segments and be more explicit about when to use x-www-form-urlencoded vs RFC 
> 3986 path encoding.
>
>
>
> On Sat, Apr 11, 2026 at 10:13 PM Christian Thiel <[email protected]> 
> wrote:
>>
>> Dear all,
>>
>> I believe the Java Iceberg REST client encodes namespace and table 
>> identifiers slightly incorrectly when constructing request URLs. Path 
>> segments are built with `java.net.URLEncoder.encode(...)`, which implements 
>> `application/x-www-form-urlencoded` — not RFC 3986 path encoding. The 
>> visible symptom is that a space becomes `+` instead of `%20`, and a literal 
>> `+` becomes `%2B` (indistinguishable from an encoded space after 
>> form-decoding).
>>
>> Root cause: `RESTUtil.encodeString(String)` wraps `URLEncoder.encode`. It 
>> has two kinds of callers with incompatible requirements:
>>
>> 1. OAuth2 form bodies (RFC 6749) — current behavior is correct.
>> 2. URL path segments in `ResourcePaths` (table / view / metrics / plan / 
>> task) and per-level namespace encoding in `RESTUtil.encodeNamespace` — 
>> current behavior is wrong per RFC 3986.
>>
>> Non-Java engines get this right. DuckDB, for example, sends `%20` for a 
>> space in a namespace or table name, so a spec-compliant server that 
>> correctly percent-decodes path segments sees a different identifier 
>> depending on which client issued the request.
>>
>> We are already using the now-customizable separator (`\u001f`) to join 
>> multi-level namespaces in path segments, which is itself a deviation from a 
>> pure "one segment per level" RFC approach. That's fine as a deliberate 
>> choice, but I believe we should still respect RFC 3986 for encoding the 
>> level contents themselves.
>>
>> Impact:
>> - Any namespace or table identifier containing a space, `+`, or other 
>> characters where form-urlencoded and RFC 3986 path encoding disagree (I 
>> believe space is bar far the most important one) is sent on the wire with 
>> the wrong encoding from the Java client.
>> - A server that correctly decodes path segments sees `my+ns` instead of `my 
>> ns` — leading to 404s, silent access of the wrong object, or catalog 
>> inconsistency if two identifiers collide after decoding (`"a b"` vs `"a+b"`).
>> - Cross-engine interop breaks: an object created by a non-Java client with a 
>> space in the name is not addressable from the Java client, and vice versa.
>> - At Lakekeeper we have for some time now prohibited creation of objects 
>> with `+` in their name and interpret `+` in path segments as space on read, 
>> as a pragmatic workaround. Creation is unambiguous because the identifier 
>> arrives in the request body, not the path, so we can reject it there. 
>> Read/update/drop paths are the ones where ambiguity bites. In other Catalogs 
>> some clients simply can't load or write to affected tables.
>> - The OAuth2 test in `TestRESTUtil` pins form-encoding behavior, and 
>> `TestResourcePaths` even asserts `"plan with spaces"` → `"plan+with+spaces"` 
>> in a path — so the current behavior is locked in by tests. No tests cover 
>> namespace/table identifiers containing spaces or `+`.
>>
>> Does anyone see a problem with fixing this in the Java client? I'd like to 
>> understand whether anyone is relying on the current encoding (servers that 
>> form-decode path segments, proxies, intermediate tooling) before opening an 
>> issue/PR. If it turns out there are too many compatibility concerns to fix 
>> it outright, I think we should at the very least document the current 
>> encoding behavior explicitly in the REST spec, so server implementers and 
>> other clients can interoperate deliberately. Related to that, we should also 
>> disallow affected identifiers from being routed through generic OpenAPI code 
>> generation for path parameters — a standards-compliant generated client will 
>> encode per RFC 3986, and silently round-tripping names through such a client 
>> against a form-decoding server permanently loses the distinction between 
>> space and `+` (and the original name with it).
>>
>> Thanks,
>> Christian
>>
>> References (permalinks on `main` @ `7e4aa89`):
>> - `RESTUtil.encodeString`: 
>> https://github.com/apache/iceberg/blob/7e4aa89d9900a52620afd1456152b63b47f2223b/core/src/main/java/org/apache/iceberg/rest/RESTUtil.java#L154-L157
>> - `RESTUtil.encodeNamespace` per-level encoding: 
>> https://github.com/apache/iceberg/blob/7e4aa89d9900a52620afd1456152b63b47f2223b/core/src/main/java/org/apache/iceberg/rest/RESTUtil.java#L288-L300
>> - `ResourcePaths` path-segment callers: 
>> https://github.com/apache/iceberg/blob/7e4aa89d9900a52620afd1456152b63b47f2223b/core/src/main/java/org/apache/iceberg/rest/ResourcePaths.java#L111
>> - `TestResourcePaths` pinning `+` for space in a path: 
>> https://github.com/apache/iceberg/blob/7e4aa89d9900a52620afd1456152b63b47f2223b/core/src/test/java/org/apache/iceberg/rest/TestResourcePaths.java#L321-L330
>> - `TestRESTUtil.testOAuth2URLEncoding`: 
>> https://github.com/apache/iceberg/blob/7e4aa89d9900a52620afd1456152b63b47f2223b/core/src/test/java/org/apache/iceberg/rest/TestRESTUtil.java#L143-L149
>>

Reply via email to