Just curious—why did we originally introduce %1F as a separator?

When we were first discussing how to send multi-part namespace identifiers,
there was a choice about how to encode them. I advocated for using . as we
do for column names, but there were people that felt strongly about needing
to use a different character (at first %00 and later %1F) so that the name
could be split back into the original list of identifiers.

The trade-off when using . is that the conversion is one-way: you can
convert ["a", "b.c"] into "a.b.c" but you can’t reliably split (which would
produce ["a", "b", "c"]). Using a one-way conversion is simpler because it
avoids the need for a reliable separator character (and that’s been a
problem) and avoids the alternative to have more complicated escaping and
quoting rules.

The primary drawback to a one-way conversion is that you can have at most
one object with the flattened name. If the service receives "a.b.c" then
there can be only ["a", "b", "c"] OR ["a.b", "c"] OR ["a", "b.c"] OR
["a.b.c"] because all of those names are identical when flattened.

At the time we originally discussed this, there was support for being able
to distinguish between names that flatten to the same representation. I
think that’s fine, but I really like the current proposal to let the
separator character as a long-term fix. That way catalogs can choose their
own behavior.

A follow up question: can we restrict the character “.” in a namespace
name?

I don’t think there’s a need to do this if catalogs can choose their own
separator character. Hive catalogs could choose not to support ., in which
case using it as a separator doesn’t have any issues. Another catalog could
choose not to allow more than one namespace that flatten to the same name.
And others could choose a completely different separator character to be
able to reconstruct the original multi-part identifier.

On Fri, Aug 2, 2024 at 10:10 AM Yufei Gu <flyrain...@gmail.com> wrote:

> Just curious—why did we originally introduce %1F as a separator? Was it
> because we wanted to allow "." as a valid character in namespaces? If
> that’s the case, I get that we couldn't use "." or  "%2e" as a separator.
>
> A follow up question: can we restrict the character "." in a namespace
> name? For example, HIVE doesn't support "." in name or database names.
>
> Yufei
>
>
> On Fri, Aug 2, 2024 at 9:44 AM Robert Stupp <sn...@snazy.de> wrote:
>
>> I'd be very careful here.
>>
>> The strings in `Namespace` elements are unconstrained. Neither the
>> `Namespace` implementation in Iceberg/Java nor the REST spec restrict the
>> contents of the namespace elements. So a '.' can appear in existing
>> namespace elements and choosing %2E breaks such existing namespaces.
>>
>> Changing %1F to some random other char >= 0x20 has the potential to break
>> existing namespaces.
>>
>> What's needed IMHO is likely an escaping mechanism - not a single char.
>>
>> On 02.08.24 01:42, Yufei Gu wrote:
>>
>> +1 on the first option. We may not overly use the config endpoint, but
>> it'd be suitable in this case. We can introduce a new field like this:
>>
>> namespace.separator=%2e
>>
>> Yufei
>>
>>
>> On Thu, Aug 1, 2024 at 3:46 PM Ryan Blue <b...@databricks.com.invalid>
>> <b...@databricks.com.invalid> wrote:
>>
>>> I think the simplest way to preserve compatibility is to allow this to
>>> be configured on the client and by the config route, and fall back to the
>>> current value, 0x1f. Another option is to introduce a set of v2 endpoints
>>> that use a different separator character. I prefer the first option since
>>> the only way to work with a service that can't support 0x1f is to replace
>>> the separator character. Older clients are already broken, so if they don't
>>> support the property sent by the config route there is no behavior change.
>>>
>>> Ryan
>>>
>>> On Thu, Aug 1, 2024 at 9:47 AM Robert Stupp <sn...@snazy.de> wrote:
>>>
>>>> How is compatibility with older servers guaranteed?
>>>> On 01.08.24 14:59, Eduard Tudenhöfner wrote:
>>>>
>>>> Hey everyone,
>>>>
>>>> The REST spec
>>>> <https://github.com/apache/iceberg/blob/6319712b612b724fedbc5bed41942ac3426ffe48/open-api/rest-catalog-open-api.yaml#L225>
>>>> currently uses *%1F* as the UTF-8 encoded namespace separator for
>>>> multi-part namespaces.
>>>> This causes issues <https://github.com/apache/iceberg/issues/10338>,
>>>> since it's a control character
>>>> <https://www.compart.com/en/unicode/category/Cc> and the Servlet spec
>>>> <https://jakarta.ee/specifications/servlet/6.0/jakarta-servlet-spec-6.0.html#uri-path-canonicalization>
>>>>  can
>>>> reject such characters.
>>>>
>>>> I'm proposing to replace *%1F* with a different character that isn't
>>>> problematic (such as *%2E*) and also add some backwards compatible
>>>> namespace decoding logic to *RESTUtil* so that older clients sending
>>>> *%1F* can still do so.
>>>>
>>>> PS: I also investigated why *%1F* doesn't fail in *TestRESTCatalog* and
>>>> it's because we're using  Jetty 9.x and the javax.servlet API 4.0 (instead
>>>> of 6.x). I'll open a separate PR to upgrade Jetty and use jakarta.servlet
>>>> API 6.x, which will reproduce the issue with *%1F* being used as the
>>>> namespace separator.
>>>>
>>>> Eduard
>>>>
>>>>
>>>>
>>>> --
>>>> Robert Stupp
>>>> @snazy
>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Databricks
>>>
>> --
>> Robert Stupp
>> @snazy
>>
>>

-- 
Ryan Blue
Databricks

Reply via email to