On Wed, Mar 18, 2020 at 12:12 PM Reuven Lax <[email protected]> wrote:

>
>
> On Wed, Mar 18, 2020 at 12:09 PM Brian Hulette <[email protected]>
> wrote:
>
>> In Beam schemas we don't seem to have a well-defined policy around
>> special characters (like $.[]) in field names. There's never any explicit
>> validation, but we do have some ad-hoc rules (e.g. we use _ rather than the
>> more natural . when concatenating field names in a nested select [1])
>>
>> I think we should explicitly allow any special character (any valid UTF-8
>> character?) to be used in Beam schema field names. But in order to do this
>> we will need to provide solutions for some edge cases. To my knowledge
>> there are two problems that arise with some special characters in field
>> names:
>>
> 1. They can't be mapped to language types (e.g. Java Classes, and
>> NamedTuples in python).
>>
>
> We already have this problem - i.e. if you name a schema field to be int,
> or any other reserved string. We should disambiguate.
>
True, but as I point out below we have ways to deal with this problem. (2)
is really the problem we need to solve.

>
>
>> 2. It can make field accesses ambiguous (i.e. does
>> `FieldAccessDescriptor.withFieldNames("parent.child")` reference a field
>> with that exact name or a nested field?).
>>
>
> I still think that we should reserve _some_ special characters. I'm not
> sure what the use is for allowing any character to be used.
>
The use would be ensuring that we don't run into compatibility issues when
mapping schemas from other systems that have made different choices about
which characters are special.

>
>
>> We already have some precedent for (1) - Beam SQL produces field names
>> like `$col1` for unaliased fields in query outputs, and this is allowed. If
>> a user wants to map a schema with a field like this to a POJO, they have to
>> first rename the incompatible field(s), or use an @SchemaFieldName
>> annotation to map the field name. I think these are reasonable solutions.
>>
>> We do not have a solution for (2) though. I think we should allow the use
>> of a backslash to escape characters that otherwise have special meaning for
>> FieldAccessDescriptors (based on [2] this is .[]{}*).
>>
> I think the SQL way of handling this is to require a field name to be
wrapped in some way when it contains special characters, e.g.
"`some.parent.field`.`some.child.field`". We could consider that as well.

>
>> Does anyone have any objection to this proposal, or is there anything I'm
>> overlooking? If not, I'm happy to take the task to implement the escape
>> character change.
>>
>> Brian
>>
>> [1]
>> https://github.com/apache/beam/blob/8abc90b/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Select.java#L186-L189
>> [2]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/antlr/org/apache/beam/sdk/schemas/parser/generated/FieldSpecifierNotation.g4
>>
>

Reply via email to