Re: [PROPOSAL] Beam Schema Options

Alex Van Boxel Sat, 08 Feb 2020 00:25:55 -0800

I'll be happy to make a BIP out of it.

Kenneth, about your concern: I first cramped all our proto option data into
the meta data, and that worked but was very ugly. You had to parse the
binary data in the transform (very ugly). Switching to Beam options the
code became more readable, and it made more sense to the team.


On Sat, Feb 8, 2020, 04:33 Kenneth Knowles <k...@apache.org> wrote:

> All fair points. I think it is a good proposal. We already know of
> existing and future uses for it.
>
> I don't think my concerns are actually answered by this discussion. Does
> this allow/encourage creation of a PCollection that you can't make sense of
> (or can't make *good* sense of) without understanding the options? We don't
> have to answer that now and maybe it is unanswerable. If we look at proto
> options as an example it seems to be mostly OK.
>
> I think the risk is inseparable from how powerful it could be. So it is
> worth accepting. If my fears come to pass, it will still mean that Beam is
> being useful in new and unexpected ways, so that's not so bad :-)
>
> Would it make sense to write this up as a BIP, to help bootstrap the wiki
> page for them?
>
> Kenn
>
> On Fri, Feb 7, 2020 at 2:05 PM Reuven Lax <re...@google.com> wrote:
>
>> True - however at some level that's up to the user. We should be diligent
>> that we don't implement core functionality this way (so far schema metadata
>> has only been used for the fidelity use case above). However if some users
>> wants to use it more extensively in their pipeline, that's up to them.
>>
>> Reuven
>>
>> On Fri, Feb 7, 2020 at 2:02 PM Kenneth Knowles <k...@apache.org> wrote:
>>
>>> It is a good point that it applies to configuring sources and sinks
>>> mostly, or external data more generally.
>>>
>>> What I worry about is that metadata channels like this tend to take over
>>> everything, and do it worse than more structured approach.
>>>
>>> As an exaggerated example which is not actually that far-fetched,
>>> schemas could have a single type: bytes. And then the options on the fields
>>> would say how to encode/decode the bytes. Now you've created a system where
>>> the options are the main event and are actually core to the system. My
>>> expectation is that this will almost certainly happen unless we are
>>> vigilant about preventing core features from landing as options.
>>>
>>> Kenn
>>>
>>> On Fri, Feb 7, 2020 at 1:56 PM Reuven Lax <re...@google.com> wrote:
>>>
>>>> I disagree - I've had several cases where user options on fields are
>>>> very useful internally.
>>>>
>>>> A common rationale is to preserve fidelity. For instance, reading
>>>> protos, projecting out a few fields, writing protos back out. You want to
>>>> be able to nicely map protos to Beam schemas, but also preserve all the
>>>> extra metadata on proto fields. This metadata has to carry through on all
>>>> intermediate PCollections and schemas, so it makes sense to put it on the
>>>> field.
>>>>
>>>> Another example: it provides a nice extension point for annotation
>>>> extensions to a field. These are often things like Optional or Taint
>>>> markings that don't change the semantic interpretation of the type (so
>>>> shouldn't be a logical type), but do provide extra information about the
>>>> field.
>>>>
>>>> Reuven
>>>>
>>>> On Fri, Feb 7, 2020 at 8:54 AM Brian Hulette <bhule...@google.com>
>>>> wrote:
>>>>
>>>>> I'm not sure this belongs directly on schemas. I've had trouble
>>>>> reconciling that opinion, since the idea does seem very useful, and in 
>>>>> fact
>>>>> I'm interested in using it myself. I think I've figured out my concern -
>>>>> what I really want is options for a (maybe portable) Table.
>>>>>
>>>>> As I indicated in a comment in the doc [1] I still think all of the
>>>>> examples you've described only apply to IOs. To be clear, what I mean is
>>>>> that all of the examples either
>>>>> 1) modify the behavior of the external system the IO is interfacing
>>>>> with (specify partitioning, indexing, etc..), or
>>>>> 2) define some transformation that should be done to the data adjacent
>>>>> to the IO (after an Input or before an Output) in Beam
>>>>>
>>>>> (1) Is the sort of thing you described in the IO section [2] (aside
>>>>> from the PubSub example I added, since that's describing a transformation
>>>>> to do in Beam)
>>>>> I would argue that all of the other examples fall under (2) - data
>>>>> validation, adding computed columns, encryption, etc... are things that 
>>>>> can
>>>>> be done in a transform
>>>>>
>>>>> I think we can make an analogy to a traditional database here:
>>>>> schema-aware Beam IOs are like Tables in a database, other PCollections 
>>>>> are
>>>>> like intermediate results in a query. In a database, Tables can be defined
>>>>> with some DDL and have schema-level or column-level options that change
>>>>> system behavior, but intermediate results have no such capability.
>>>>>
>>>>>
>>>>> Another point I think is worth discussing: is there value in making
>>>>> these options portable?
>>>>> As it's currently defined I'm not sure there is - everything could be
>>>>> done within a single SDK. However, portable options on a portable table
>>>>> could be very powerful, since it could be used to configure cross-language
>>>>> IOs, perhaps with something like
>>>>> https://s.apache.org/xlang-table-provider/
>>>>>
>>>>> [1]
>>>>> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit?disco=AAAAI54si4k
>>>>> [2]
>>>>> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit#heading=h.8sjt9ax55hmt
>>>>>
>>>>> On Wed, Feb 5, 2020, 4:17 AM Alex Van Boxel <a...@vanboxel.be> wrote:
>>>>>
>>>>>> I would appreciate if someone would look at the following PR and get
>>>>>> it to master:
>>>>>>
>>>>>> https://github.com/apache/beam/pull/10413#
>>>>>>
>>>>>> a lot of work needs to follow, but if we have the base already on
>>>>>> master the next layers can follow. As a reminder, this is the base 
>>>>>> proposal:
>>>>>>
>>>>>> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit?usp=sharing
>>>>>>
>>>>>> I've also looked for prior work, and saw that Spark actually has
>>>>>> something comparable:
>>>>>>
>>>>>> https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Row.html
>>>>>>
>>>>>> but when the options are finished it will be far more powerful as it
>>>>>> is not limited on fields.
>>>>>>
>>>>>>  _/
>>>>>> _/ Alex Van Boxel
>>>>>>
>>>>>>
>>>>>> On Wed, Jan 29, 2020 at 4:55 AM Kenneth Knowles <k...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Using schema types for the metadata values is a nice touch.
>>>>>>>
>>>>>>> Are the options expected to be common across many fields? Perhaps
>>>>>>> the name should be a URN to make it clear to be careful about 
>>>>>>> collisions?
>>>>>>> (just a synonym for "name" in practice, but with different connotation)
>>>>>>>
>>>>>>> I generally like this... but the examples (all but one) are weird
>>>>>>> things that I don't really understand how they are done or who is
>>>>>>> responsible for them.
>>>>>>>
>>>>>>> One way to go is this: if options are maybe not understood by all
>>>>>>> consumers, then they can't really change behavior. Kind of like how URN 
>>>>>>> and
>>>>>>> payload on a composite transform can be ignored and just the expansion 
>>>>>>> used.
>>>>>>>
>>>>>>> Kenn
>>>>>>>
>>>>>>> On Sun, Jan 26, 2020 at 8:27 AM Alex Van Boxel <a...@vanboxel.be>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> I'm proud to announce my first real proposal. The proposal
>>>>>>>> describes Beam Schema Options. This is an extension to the Schema API 
>>>>>>>> to
>>>>>>>> add typed meta data to to Rows, Field and Logical Types:
>>>>>>>>
>>>>>>>>
>>>>>>>> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit?usp=sharing
>>>>>>>>
>>>>>>>> To give you some context where this proposal comes from: We've been
>>>>>>>> using dynamic meta driven pipelines for a while, but till now in an
>>>>>>>> awkward and hacky way (see my talks at the previous Beam Summits). This
>>>>>>>> proposal would bring a good way to work with metadata on the metadata 
>>>>>>>> :-).
>>>>>>>>
>>>>>>>> The proposal points to 2 pull requests with the implementation, one
>>>>>>>> for the API another for translating proto options to beam options.
>>>>>>>>
>>>>>>>> Thanks to Brian Hulette and Reuven Lax for the initial feedback.
>>>>>>>> All feedback is welcome.
>>>>>>>>
>>>>>>>>  _/
>>>>>>>> _/ Alex Van Boxel
>>>>>>>>
>>>>>>>

Re: [PROPOSAL] Beam Schema Options

Reply via email to