Re: Special characters in Beam Schema field names

Robert Bradshaw Thu, 19 Mar 2020 18:01:25 -0700

On Wed, Mar 18, 2020 at 8:01 PM Kenneth Knowles <[email protected]> wrote:
>
> I favor allowing field names to contain any unicode character, semantically. 
> I do not think encoding is a semantic property of a field name (or even a 
> string in a particular programming language) so UTF-8 doesn't need to be part 
> of it. Inputting a field name in a particular context is separable from what 
> characters can occur in the name, and the encoding of a string when it is 
> turned into bytes is orthogonal to what characters are in the string.


+1, I meant to say Unicode, not UTF-8.

> SQL has a good convention to allow any character (backticks, as you 
> demonstrated), as do most unix shells / filesystems. Note again that backtick 
> and backslash conventions are how to _input_ a field name, not the characters 
> actually in the field name. Your example of "parent.child" is a good one, 
> too: the dot is not part of any field name, but just a way to input a list of 
> names to construct a path. And your later example of using backticks around 
> the dot works perfectly if you want a dot in the field name. This is a solved 
> problem IMO, and we just have to take a solution off the shelf.
>
> Since schemas are pretty closely related with SQL, how about just using these 
> particular SQL conventions? I like backticks and I also like backslashes.

Makes sense to me.

> For debuggability, we need to always print a properly unparsed identifier, 
> not just print the field name as a string. So in the example of "we use _ 
> rather than the more natural . when concatenating field names in a nested 
> select" I would prefer to just use a dot, for clarity, and when printing it 
> the position of the backticks will make it totally clear that the dot is not 
> a field separator.

If we're generating *new* field names, I'd just as soon a convention
that generates non-special ones just for ease of use.

> On Wed, Mar 18, 2020 at 5:09 PM Robert Bradshaw <[email protected]> wrote:
>>
>> Give the flexibility of SQL, and the diversity of upstream systems,
>> I'd lean on the side of being maximally flexible and saying a field
>> name is a utf-8 string (including whitespace?), but special characters
>> may require quoting and/or not allow some convenience (e.g. POJO
>> creation).
>>
>> On Wed, Mar 18, 2020 at 4:48 PM Brian Hulette <[email protected]> wrote:
>> >
>> > Another thing to consider: Both Calcite [1] and ZetaSQL [2] allow (quoted) 
>> > field names to contain any character. So it's currently possible for 
>> > SqlTransform to produce schemas with field names containing dots and other 
>> > special characters, which we can't handle properly outside of the SQL 
>> > context. If we do want to have some special characters, I think we should 
>> > validate that schemas don't contain them, which would limit what you can 
>> > output with SqlTransform, for better or worse.
>> >
>> > > We impose limits on Beam field names, and have automatic ways of 
>> > > escaping or translating characters that don't match. When the Beam field 
>> > > name does not match the field name in other systems, we use field 
>> > > Options to store the "original" name so it is not lost. That way we 
>> > > don't have to rely on the field names always being textually identical.
>> >
>> > A good use of the new Options feature :)
>> > One of the problems I would like this thread to solve though is the 
>> > possibility of using schemas and rows for the Options themselves 
>> > (discussed extensively in Alex's PR [3]). If we use Options to handle 
>> > special characters, we would need options on the schema of the Options (I 
>> > think I said that right?) to solve it in that context.
>> >
>> > > I'm all for initial strict naming rules, that we can relax as we learn 
>> > > more. Additional restrictions tend to require major version changes to 
>> > > accommodate the backwards incompatibility.
>> >
>> > I think it may be too late to be strict though, since schemas came from 
>> > SQL, and both supported SQL dialects are very permissive here. At this 
>> > point it seems easier to be very permissive within Beam, and provide ways 
>> > to deal with incompatibilities at the boundaries (e.g. SDKs providing ways 
>> > to translate fields for language types, raising errors when a schema is 
>> > incompatible for some IO, etc).
>> >
>> > [1] https://calcite.apache.org/docs/reference.html#identifiers
>> > [2] 
>> > https://github.com/google/zetasql/blob/master/docs/lexical.md#identifiers
>> > [3] https://github.com/apache/beam/pull/10413
>> >
>> > On Wed, Mar 18, 2020 at 4:06 PM Robert Burke <[email protected]> wrote:
>> >>
>> >> I'm all for initial strict naming rules, that we can relax as we learn 
>> >> more. Additional restrictions tend to require major version changes to 
>> >> accommodate the backwards incompatibility.
>> >>
>> >> I'd rather community provide compelling use cases for relaxations than us 
>> >> speculating what could be useful in the outset.
>> >>
>> >> That said, it might be a touch late for schema fields...
>> >>
>> >> It's definitely my Go Bias showing but a sensible start is to not allow 
>> >> fields to start with a digit. This matches most C derived languages 
>> >> (which includes all our SDK languages at present, except maybe for 
>> >> Scio...).
>> >>
>> >>
>> >>
>> >> On Wed, Mar 18, 2020, 2:59 PM Reuven Lax <[email protected]> wrote:
>> >>>
>> >>> For completeness, here's another proposal.
>> >>>
>> >>> We impose limits on Beam field names, and have automatic ways of 
>> >>> escaping or translating characters that don't match. When the Beam field 
>> >>> name does not match the field name in other systems, we use field 
>> >>> Options to store the "original" name so it is not lost. That way we 
>> >>> don't have to rely on the field names always being textually identical.
>> >>>
>> >>> Downside here: any time we automatically munge a field name, we make 
>> >>> select statements a bit more awkward, as the user has to put the munged 
>> >>> field name into the select.
>> >>>
>> >>> Reuven
>> >>>
>> >>> On Wed, Mar 18, 2020 at 12:22 PM Brian Hulette <[email protected]> 
>> >>> wrote:
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Wed, Mar 18, 2020 at 12:12 PM Reuven Lax <[email protected]> wrote:
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Wed, Mar 18, 2020 at 12:09 PM Brian Hulette <[email protected]> 
>> >>>>> wrote:
>> >>>>>>
>> >>>>>> In Beam schemas we don't seem to have a well-defined policy around 
>> >>>>>> special characters (like $.[]) in field names. There's never any 
>> >>>>>> explicit validation, but we do have some ad-hoc rules (e.g. we use _ 
>> >>>>>> rather than the more natural . when concatenating field names in a 
>> >>>>>> nested select [1])
>> >>>>>>
>> >>>>>> I think we should explicitly allow any special character (any valid 
>> >>>>>> UTF-8 character?) to be used in Beam schema field names. But in order 
>> >>>>>> to do this we will need to provide solutions for some edge cases. To 
>> >>>>>> my knowledge there are two problems that arise with some special 
>> >>>>>> characters in field names:
>> >>>>>>
>> >>>>>> 1. They can't be mapped to language types (e.g. Java Classes, and 
>> >>>>>> NamedTuples in python).
>> >>>>>
>> >>>>>
>> >>>>> We already have this problem - i.e. if you name a schema field to be 
>> >>>>> int, or any other reserved string. We should disambiguate.
>> >>>>
>> >>>> True, but as I point out below we have ways to deal with this problem. 
>> >>>> (2) is really the problem we need to solve.
>> >>>>>
>> >>>>>
>> >>>>>>
>> >>>>>> 2. It can make field accesses ambiguous (i.e. does 
>> >>>>>> `FieldAccessDescriptor.withFieldNames("parent.child")` reference a 
>> >>>>>> field with that exact name or a nested field?).
>> >>>>>
>> >>>>>
>> >>>>> I still think that we should reserve _some_ special characters. I'm 
>> >>>>> not sure what the use is for allowing any character to be used.
>> >>>>
>> >>>> The use would be ensuring that we don't run into compatibility issues 
>> >>>> when mapping schemas from other systems that have made different 
>> >>>> choices about which characters are special.
>> >>>>>
>> >>>>>
>> >>>>>>
>> >>>>>> We already have some precedent for (1) - Beam SQL produces field 
>> >>>>>> names like `$col1` for unaliased fields in query outputs, and this is 
>> >>>>>> allowed. If a user wants to map a schema with a field like this to a 
>> >>>>>> POJO, they have to first rename the incompatible field(s), or use an 
>> >>>>>> @SchemaFieldName annotation to map the field name. I think these are 
>> >>>>>> reasonable solutions.
>> >>>>>>
>> >>>>>> We do not have a solution for (2) though. I think we should allow the 
>> >>>>>> use of a backslash to escape characters that otherwise have special 
>> >>>>>> meaning for FieldAccessDescriptors (based on [2] this is .[]{}*).
>> >>>>
>> >>>> I think the SQL way of handling this is to require a field name to be 
>> >>>> wrapped in some way when it contains special characters, e.g. 
>> >>>> "`some.parent.field`.`some.child.field`". We could consider that as 
>> >>>> well.
>> >>>>>>
>> >>>>>>
>> >>>>>> Does anyone have any objection to this proposal, or is there anything 
>> >>>>>> I'm overlooking? If not, I'm happy to take the task to implement the 
>> >>>>>> escape character change.
>> >>>>>>
>> >>>>>> Brian
>> >>>>>>
>> >>>>>> [1] 
>> >>>>>> https://github.com/apache/beam/blob/8abc90b/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Select.java#L186-L189
>> >>>>>> [2] 
>> >>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/antlr/org/apache/beam/sdk/schemas/parser/generated/FieldSpecifierNotation.g4

Re: Special characters in Beam Schema field names

Reply via email to