I've opened #10877 <https://github.com/apache/iceberg/pull/10877> a few
days ago to make the namespace separator configurable and let servers
communicate to clients which separator should be used. Worth mentioning
that this doesn't require any spec chance and it is backwards compatible
with older clients that send *%1F*.

I'm ok if we want to make the namespace separator also be controllable from
the client and send a *delim* query param, but I agree with Dmitri that the
server should still be able to tell a client which separator it prefers (as
indicated in #10877 <https://github.com/apache/iceberg/pull/10877>).

Sending the *delim* query param requires a non-breaking spec change, so
I've opened #10904 <https://github.com/apache/iceberg/pull/10904>.

After we vote on the spec change I'll add the required impl
<https://github.com/apache/iceberg/pull/10905> (which just goes on top of
#10877 <https://github.com/apache/iceberg/pull/10877>) to send the
*delim* query
param.

Eduard

On Thu, Aug 8, 2024 at 12:52 AM Dmitri Bourlatchkov
<dmitri.bourlatch...@dremio.com.invalid> wrote:

> The idea of client-chosen separator char (?delim=.) sounds pretty
> reasonable to me.
>
> Nonetheless, I do not think this covers all the issues in putting
> namespaces in URI paths for servers running under the new Servlet spec. In
> particular, there are other chars that are considered "suspicious" by the
> servlet spec and may not be allowed.
>
> Since Iceberg defines a REST API spec, I think it is reasonable for REST
> server implementers to expect Iceberg to provide means in the REST spec to
> ensure compatibility with modern servlet frameworks. This is critical so
> that REST clients and servers have the same approach to URI
> construction/interpretation.
>
> That said, if the general consensus is to keep this discussion focused
> only on the namespace element separators, I'd like to add a couple of
> points:
>
> It would be nice to have a default value for this parameter matching
> current separator to ensure compatibility with older clients.
>
> I think we should also allow a client-side configuration setting for
> end-users to control this. I think it would be ok to allow servers to
> advertise a default value for the separator. However servers should accept
> different values according to the `delim` parameter. End-users should be
> able to override server-side config with their own client-side config since
> end users ultimately know more about their namespace names.
>
> Cheers,
> Dmitri.
>
> On Wed, Aug 7, 2024 at 6:07 PM Jack Ye <yezhao...@gmail.com> wrote:
>
>> Sorry a bit late to this thread.
>>
>> I would personally prefer the client side separator solution (query param
>> with `?delim=.`) a bit more than the server side (config override), just
>> given the experience of handling similar situations for Glue data catalog
>> which allows any name for database (namespace) and table, except for white
>> space characters [1].
>>
>> In the Glue-HiveMetaStore connector [2], there is a feature to use a
>> single string to reference a namespace in a non-default catalog, e.g.
>> "cat1:ns1" can mean catalog cat1 namespace ns1. This is basically a 2-level
>> namespace. After a similar discussion thread and exploring what users are
>> actually using, we realized that no separator could work for everyone, so
>> we introduced a config value [3] that users can set at client side when
>> using the connector. If there is a namespace with name "ns1:ns2" in a
>> catalog "cat1", then the user can choose a different separator like "$' to
>> write "cat1$ns1:ns2", which can allow us to correctly resolve all the name
>> references.
>>
>> We found this the most flexible, since an organization would typically
>> stick to just one or two special characters, and can always find something
>> else as a separator. Even in extreme cases, a user can choose a long string
>> like "__SEPARATOR__" as the separator.
>>
>> -Jack
>>
>> [1] https://docs.aws.amazon.com/glue/latest/webapi/API_GlueTable.html
>> [2]
>> https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore
>> [3]
>> https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/blob/branch-3.4.0/aws-glue-datacatalog-client-common/src/main/java/com/amazonaws/glue/catalog/util/AWSGlueConfig.java#L26
>>
>>
>> On Tue, Aug 6, 2024 at 1:32 PM Ryan Blue <b...@databricks.com.invalid>
>> wrote:
>>
>>>
>>>    1. Change %1F to %2E:
>>>
>>> As I noted in my earlier email, a catalog may choose to use . as a
>>> one-way conversion so that it doesn’t matter that you can’t split the
>>> namespace. This does work, but with slightly different behavior.
>>>
>>> The original decision on this issue was that the behavior should be a
>>> catalog choice, which is why we went with what we thought was a safe
>>> delimiter. For this discussion, the model I originally proposed could be
>>> supported by setting the delimiter to ..
>>>
>>> New option returned from the config-endpoint telling the which
>>> namespace-element separator to use
>>>
>>> As Eduard noted, old clients affected by this problem are already broken
>>> so it is okay if the fix is only for newer clients. If a client isn’t
>>> broken, then the service still needs to support 0x1F for compatibility.
>>> Given that 0x1F is not a common character in names, I think that should
>>> be safe.
>>>
>>> I do not see any other option than changing the REST spec and define an
>>> escaping mechanism, which requires new endpoints.
>>>
>>> I disagree here. I think making the delimiter configurable is a good
>>> solution that doesn’t require new endpoints. I definitely prefer this to a
>>> more complicated scheme.
>>>
>>> Ryan
>>>
>>> On Tue, Aug 6, 2024 at 2:06 AM Robert Stupp <sn...@snazy.de> wrote:
>>>
>>>> I'd like to summarize the proposals that came up:
>>>>
>>>> 1. Change `%1F` to `%2E`:
>>>>
>>>> Already existing namespaces that have the dot-character (`%2E` == `.`)
>>>> become inaccessible. A namespace ["my", "elem.foo"] is then encoded as
>>>> `my.elem.foo` and decoded as ["my", "elem", "foo"], which is incorrect.
>>>>
>>>> 2. New option returned from the config-endpoint telling the which
>>>> namespace-element separator to use:
>>>>
>>>> (Old) clients won't respect this option - and in turn, the service has
>>>> to expect both `%1F` and the "proposed" separator character. If an old
>>>> client sends a namespace element that contains the separator proposed by
>>>> the service, the namespace representation on the service side is not the
>>>> same as the one on the client side. Ex: A new service advertises `%2E" via
>>>> the config endpoint - the old client doesn't know about it and encodes
>>>> ["my", "elem", "foo"] as `my%1Felem%1Ffoo`, which the service either
>>>> rejects with a HTTP/4xx or interprets as ["my%1Felem%1Ffoo"] - both are
>>>> incorrect.
>>>>
>>>> 3. Clients send a new query parameter `?delim=.`
>>>>
>>>> (Old) services that don't know about this new query parameter will
>>>> interpret the namespace differently, the namespace representation on the
>>>> service side is not the same as the one on the client side. A client
>>>> encodes a namespace ["my", "elem", "foo"] as `my.elem.foo` and adds the
>>>> `?delim=.` query param. Old services interpret this as ["my.elem.foo"],
>>>> which is incorrect.
>>>>
>>>>
>>>> In any case, services _have to_ support `%1F` or compatibility w/ older
>>>> clients will be broken.
>>>>
>>>> I do not see any other option than changing the REST spec and define an
>>>> escaping mechanism, which requires new endpoints.
>>>>
>>>> All options proposed so far are potentially breaking REST spec changes.
>>>>
>>>>
>>>> On 05.08.24 19:06, Daniel Weeks wrote:
>>>>
>>>> I would agree with adding either a server side (config override) or
>>>> client side control (query param with `?delim=.`) as it will be
>>>> compatible with the current v1 endpoint.
>>>>
>>>> In the future we could introduce a v2 endpoint(s), but I would want to
>>>> wait for OpenAPI 4 because they address this by allowing multi-segment
>>>> pathing via URI templates in RFC 6570
>>>> <https://datatracker.ietf.org/doc/html/rfc6570>, which is the original
>>>> way we wanted to represent namespaces, but it wasn't supported (e.g.
>>>> .../{+namespaces}/tables/{table}).  I doubt it's really worth the
>>>> effort though, so I feel like a configurable delimiter makes the most 
>>>> sense.
>>>>
>>>> -Dan
>>>>
>>>> --
>>>> Robert Stupp
>>>> @snazy
>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Databricks
>>>
>>

Reply via email to