Re: [DISCUSS] Portability representation of schemas

Robert Bradshaw Wed, 08 May 2019 18:40:09 -0700

From: Reuven Lax <[email protected]>
Date: Wed, May 8, 2019 at 10:36 PM
To: dev


> On Wed, May 8, 2019 at 1:23 PM Robert Bradshaw <[email protected]> wrote:
>>
>> Very excited to see this. In particular, I think this will be very
>> useful for cross-language pipelines (not just SQL, but also for
>> describing non-trivial data (e.g. for source and sink reuse).
>>
>> The proto specification makes sense to me. The only thing that looks
>> like it's missing (other than possibly iterable, for arbitrarily-large
>> support) is multimap. Another basic type, should we want to support
>> it, is union (though this of course can get messy).
>
> multimap is an interesting suggestion. Do you have a use case in mind?
>
> union (or oneof) is also a good suggestion. There are good use cases for 
> this, but this is a more fundamental change.

No specific usecase, they just seemed to round out the options.

>> I'm curious what the rational was for going with a oneof for type_info
>> rather than an repeated components like we do with coders.
>
> No strong reason. Do you think repeated components is better than oneof?

It's more consistent with how we currently do coders (which has pros and cons).

>> Removing DATETIME as a logical coder on top of INT64 may cause issues
>> of insufficient resolution and/or timespan. Similarly with DECIMAL (or
>> would it be backed by string?)
>
> There could be multiple TIMESTAMP types for different resolutions, and they 
> don't all need the same backing field type. E.g. the backing type for 
> nanoseconds could by Row(INT64, INT64), or it could just be a byte array.

Hmm.... What would the value be in supporting different types of
timestamps? Would all SDKs have to support all of them? Can one
compare, take differences, etc. across timestamp types? (As Luke
points out, the other conversation on timestamps is likely relevant
here as well.)

>> The biggest question, as far as portability is concerned at least, is
>> the notion of logical types. serialized_class is clearly not portable,
>> and I also think we'll want a way to share semantic meaning across
>> SDKs (especially if things like dates become logical types). Perhaps
>> URNs (+payloads) would be a better fit here?
>
> Yes, URN + payload is probably the better fit for portability.
>
>> Taking a step back, I think it's worth asking why we have different
>> types, rather than simply making everything a LogicalType of bytes
>> (aka coder). Other than encoding format, the answer I can come up with
>> is that the type decides the kinds of operations that can be done on
>> it, e.g. does it support comparison? Arithmetic? Containment?
>> Higher-level date operations? Perhaps this should be used to guide the
>> set of types we provide.
>
> Also even though we could make everything a LogicalType (though at least byte 
> array would have to stay primitive), I think  it's useful to have a slightly 
> larger set of primitive types.  It makes things easier to understand and 
> debug, and it makes it simpler for the various SDKs to map them to their 
> types (e.g. mapping to POJOs).

 This would be the case if one didn't have LogicalType at all, but
once one introduces that one now has this more complicated two-level
hierarchy of types which doesn't seem simpler to me.

I'm trying to understand what information Schema encodes that a
NamedTupleCoder (or RowCoder) would/could not. (Coders have the
disadvantage that there are multiple encodings of a single value, e.g.
BigEndian vs. VarInt, but if we have multiple resolutions of timestamp
that would still seem to be an issue. Possibly another advantage is
encoding into non-record-oriented formats, e.g. Parquet or Arrow, that
have a set of primitives.)

Re: [DISCUSS] Portability representation of schemas

Reply via email to