Re: Special characters in Beam Schema field names

Brian Hulette Wed, 18 Mar 2020 16:49:00 -0700

Another thing to consider: Both Calcite [1] and ZetaSQL [2] allow (quoted)
field names to contain any character. So it's currently possible for
SqlTransform to produce schemas with field names containing dots and other
special characters, which we can't handle properly outside of the SQL
context. If we do want to have some special characters, I think we should
validate that schemas don't contain them, which would limit what you can
output with SqlTransform, for better or worse.


> We impose limits on Beam field names, and have automatic ways of escaping
or translating characters that don't match. When the Beam field name does
not match the field name in other systems, we use field Options to store
the "original" name so it is not lost. That way we don't have to rely on
the field names always being textually identical.

A good use of the new Options feature :)
One of the problems I would like this thread to solve though is the
possibility of using schemas and rows for the Options themselves (discussed
extensively in Alex's PR [3]). If we use Options to handle special
characters, we would need options on the schema of the Options (I think I
said that right?) to solve it in that context.

> I'm all for initial strict naming rules, that we can relax as we learn
more. Additional restrictions tend to require major version changes to
accommodate the backwards incompatibility.

I think it may be too late to be strict though, since schemas came from
SQL, and both supported SQL dialects are very permissive here. At this
point it seems easier to be very permissive within Beam, and provide ways
to deal with incompatibilities at the boundaries (e.g. SDKs providing ways
to translate fields for language types, raising errors when a schema is
incompatible for some IO, etc).

[1] https://calcite.apache.org/docs/reference.html#identifiers
[2]
https://github.com/google/zetasql/blob/master/docs/lexical.md#identifiers
[3] https://github.com/apache/beam/pull/10413

On Wed, Mar 18, 2020 at 4:06 PM Robert Burke <[email protected]> wrote:

> I'm all for initial strict naming rules, that we can relax as we learn
> more. Additional restrictions tend to require major version changes to
> accommodate the backwards incompatibility.
>
> I'd rather community provide compelling use cases for relaxations than us
> speculating what could be useful in the outset.
>
> That said, it might be a touch late for schema fields...
>
> It's definitely my Go Bias showing but a sensible start is to not allow
> fields to start with a digit. This matches most C derived languages (which
> includes all our SDK languages at present, except maybe for Scio...).
>
>
>
> On Wed, Mar 18, 2020, 2:59 PM Reuven Lax <[email protected]> wrote:
>
>> For completeness, here's another proposal.
>>
>> We impose limits on Beam field names, and have automatic ways of escaping
>> or translating characters that don't match. When the Beam field name does
>> not match the field name in other systems, we use field Options to store
>> the "original" name so it is not lost. That way we don't have to rely on
>> the field names always being textually identical.
>>
>> Downside here: any time we automatically munge a field name, we make
>> select statements a bit more awkward, as the user has to put the munged
>> field name into the select.
>>
>> Reuven
>>
>> On Wed, Mar 18, 2020 at 12:22 PM Brian Hulette <[email protected]>
>> wrote:
>>
>>>
>>>
>>> On Wed, Mar 18, 2020 at 12:12 PM Reuven Lax <[email protected]> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Mar 18, 2020 at 12:09 PM Brian Hulette <[email protected]>
>>>> wrote:
>>>>
>>>>> In Beam schemas we don't seem to have a well-defined policy around
>>>>> special characters (like $.[]) in field names. There's never any explicit
>>>>> validation, but we do have some ad-hoc rules (e.g. we use _ rather than 
>>>>> the
>>>>> more natural . when concatenating field names in a nested select [1])
>>>>>
>>>>> I think we should explicitly allow any special character (any valid
>>>>> UTF-8 character?) to be used in Beam schema field names. But in order to 
>>>>> do
>>>>> this we will need to provide solutions for some edge cases. To my 
>>>>> knowledge
>>>>> there are two problems that arise with some special characters in field
>>>>> names:
>>>>>
>>>> 1. They can't be mapped to language types (e.g. Java Classes, and
>>>>> NamedTuples in python).
>>>>>
>>>>
>>>> We already have this problem - i.e. if you name a schema field to be
>>>> int, or any other reserved string. We should disambiguate.
>>>>
>>> True, but as I point out below we have ways to deal with this problem.
>>> (2) is really the problem we need to solve.
>>>
>>>>
>>>>
>>>>> 2. It can make field accesses ambiguous (i.e. does
>>>>> `FieldAccessDescriptor.withFieldNames("parent.child")` reference a field
>>>>> with that exact name or a nested field?).
>>>>>
>>>>
>>>> I still think that we should reserve _some_ special characters. I'm not
>>>> sure what the use is for allowing any character to be used.
>>>>
>>> The use would be ensuring that we don't run into compatibility issues
>>> when mapping schemas from other systems that have made different choices
>>> about which characters are special.
>>>
>>>>
>>>>
>>>>> We already have some precedent for (1) - Beam SQL produces field names
>>>>> like `$col1` for unaliased fields in query outputs, and this is allowed. 
>>>>> If
>>>>> a user wants to map a schema with a field like this to a POJO, they have 
>>>>> to
>>>>> first rename the incompatible field(s), or use an @SchemaFieldName
>>>>> annotation to map the field name. I think these are reasonable solutions.
>>>>>
>>>>> We do not have a solution for (2) though. I think we should allow the
>>>>> use of a backslash to escape characters that otherwise have special 
>>>>> meaning
>>>>> for FieldAccessDescriptors (based on [2] this is .[]{}*).
>>>>>
>>>> I think the SQL way of handling this is to require a field name to be
>>> wrapped in some way when it contains special characters, e.g.
>>> "`some.parent.field`.`some.child.field`". We could consider that as well.
>>>
>>>>
>>>>> Does anyone have any objection to this proposal, or is there anything
>>>>> I'm overlooking? If not, I'm happy to take the task to implement the 
>>>>> escape
>>>>> character change.
>>>>>
>>>>> Brian
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/beam/blob/8abc90b/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Select.java#L186-L189
>>>>> [2]
>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/antlr/org/apache/beam/sdk/schemas/parser/generated/FieldSpecifierNotation.g4
>>>>>
>>>>

Re: Special characters in Beam Schema field names

Reply via email to