Re: [PROPOSAL] Beam Schema Options

Reuven Lax Fri, 07 Feb 2020 14:05:56 -0800

True - however at some level that's up to the user. We should be diligent
that we don't implement core functionality this way (so far schema metadata
has only been used for the fidelity use case above). However if some users
wants to use it more extensively in their pipeline, that's up to them.


Reuven

On Fri, Feb 7, 2020 at 2:02 PM Kenneth Knowles <[email protected]> wrote:

> It is a good point that it applies to configuring sources and sinks
> mostly, or external data more generally.
>
> What I worry about is that metadata channels like this tend to take over
> everything, and do it worse than more structured approach.
>
> As an exaggerated example which is not actually that far-fetched, schemas
> could have a single type: bytes. And then the options on the fields would
> say how to encode/decode the bytes. Now you've created a system where the
> options are the main event and are actually core to the system. My
> expectation is that this will almost certainly happen unless we are
> vigilant about preventing core features from landing as options.
>
> Kenn
>
> On Fri, Feb 7, 2020 at 1:56 PM Reuven Lax <[email protected]> wrote:
>
>> I disagree - I've had several cases where user options on fields are very
>> useful internally.
>>
>> A common rationale is to preserve fidelity. For instance, reading protos,
>> projecting out a few fields, writing protos back out. You want to be able
>> to nicely map protos to Beam schemas, but also preserve all the extra
>> metadata on proto fields. This metadata has to carry through on all
>> intermediate PCollections and schemas, so it makes sense to put it on the
>> field.
>>
>> Another example: it provides a nice extension point for annotation
>> extensions to a field. These are often things like Optional or Taint
>> markings that don't change the semantic interpretation of the type (so
>> shouldn't be a logical type), but do provide extra information about the
>> field.
>>
>> Reuven
>>
>> On Fri, Feb 7, 2020 at 8:54 AM Brian Hulette <[email protected]> wrote:
>>
>>> I'm not sure this belongs directly on schemas. I've had trouble
>>> reconciling that opinion, since the idea does seem very useful, and in fact
>>> I'm interested in using it myself. I think I've figured out my concern -
>>> what I really want is options for a (maybe portable) Table.
>>>
>>> As I indicated in a comment in the doc [1] I still think all of the
>>> examples you've described only apply to IOs. To be clear, what I mean is
>>> that all of the examples either
>>> 1) modify the behavior of the external system the IO is interfacing with
>>> (specify partitioning, indexing, etc..), or
>>> 2) define some transformation that should be done to the data adjacent
>>> to the IO (after an Input or before an Output) in Beam
>>>
>>> (1) Is the sort of thing you described in the IO section [2] (aside from
>>> the PubSub example I added, since that's describing a transformation to do
>>> in Beam)
>>> I would argue that all of the other examples fall under (2) - data
>>> validation, adding computed columns, encryption, etc... are things that can
>>> be done in a transform
>>>
>>> I think we can make an analogy to a traditional database here:
>>> schema-aware Beam IOs are like Tables in a database, other PCollections are
>>> like intermediate results in a query. In a database, Tables can be defined
>>> with some DDL and have schema-level or column-level options that change
>>> system behavior, but intermediate results have no such capability.
>>>
>>>
>>> Another point I think is worth discussing: is there value in making
>>> these options portable?
>>> As it's currently defined I'm not sure there is - everything could be
>>> done within a single SDK. However, portable options on a portable table
>>> could be very powerful, since it could be used to configure cross-language
>>> IOs, perhaps with something like
>>> https://s.apache.org/xlang-table-provider/
>>>
>>> [1]
>>> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit?disco=AAAAI54si4k
>>> [2]
>>> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit#heading=h.8sjt9ax55hmt
>>>
>>> On Wed, Feb 5, 2020, 4:17 AM Alex Van Boxel <[email protected]> wrote:
>>>
>>>> I would appreciate if someone would look at the following PR and get it
>>>> to master:
>>>>
>>>> https://github.com/apache/beam/pull/10413#
>>>>
>>>> a lot of work needs to follow, but if we have the base already on
>>>> master the next layers can follow. As a reminder, this is the base 
>>>> proposal:
>>>>
>>>> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit?usp=sharing
>>>>
>>>> I've also looked for prior work, and saw that Spark actually has
>>>> something comparable:
>>>>
>>>> https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Row.html
>>>>
>>>> but when the options are finished it will be far more powerful as it is
>>>> not limited on fields.
>>>>
>>>>  _/
>>>> _/ Alex Van Boxel
>>>>
>>>>
>>>> On Wed, Jan 29, 2020 at 4:55 AM Kenneth Knowles <[email protected]>
>>>> wrote:
>>>>
>>>>> Using schema types for the metadata values is a nice touch.
>>>>>
>>>>> Are the options expected to be common across many fields? Perhaps the
>>>>> name should be a URN to make it clear to be careful about collisions? 
>>>>> (just
>>>>> a synonym for "name" in practice, but with different connotation)
>>>>>
>>>>> I generally like this... but the examples (all but one) are weird
>>>>> things that I don't really understand how they are done or who is
>>>>> responsible for them.
>>>>>
>>>>> One way to go is this: if options are maybe not understood by all
>>>>> consumers, then they can't really change behavior. Kind of like how URN 
>>>>> and
>>>>> payload on a composite transform can be ignored and just the expansion 
>>>>> used.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Sun, Jan 26, 2020 at 8:27 AM Alex Van Boxel <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I'm proud to announce my first real proposal. The proposal describes
>>>>>> Beam Schema Options. This is an extension to the Schema API to add typed
>>>>>> meta data to to Rows, Field and Logical Types:
>>>>>>
>>>>>>
>>>>>> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit?usp=sharing
>>>>>>
>>>>>> To give you some context where this proposal comes from: We've been
>>>>>> using dynamic meta driven pipelines for a while, but till now in an
>>>>>> awkward and hacky way (see my talks at the previous Beam Summits). This
>>>>>> proposal would bring a good way to work with metadata on the metadata 
>>>>>> :-).
>>>>>>
>>>>>> The proposal points to 2 pull requests with the implementation, one
>>>>>> for the API another for translating proto options to beam options.
>>>>>>
>>>>>> Thanks to Brian Hulette and Reuven Lax for the initial feedback. All
>>>>>> feedback is welcome.
>>>>>>
>>>>>>  _/
>>>>>> _/ Alex Van Boxel
>>>>>>
>>>>>

Re: [PROPOSAL] Beam Schema Options

Reply via email to