Re: Schema-Aware PCollections revisited

Romain Manni-Bucau Sun, 04 Feb 2018 09:13:54 -0800

2018-02-04 17:53 GMT+01:00 Reuven Lax <re...@google.com>:

>
>
> On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>>
>> 2018-02-04 17:37 GMT+01:00 Reuven Lax <re...@google.com>:
>>
>>> I'm not sure where proto comes from here. Proto is one example of a type
>>> that has a schema, but only one example.
>>>
>>> 1. In the initial prototype I want to avoid modifying the PCollection
>>> API. So I think it's best to create a special SchemaCoder, and pass the
>>> schema into this coder. Later we might targeted APIs for this instead of
>>> going through a coder.
>>> 1.a I don't see what hints have to do with this?
>>>
>>
>> Hints are a way to replace the new API and unify the way to pass metadata
>> in beam instead of adding a new custom way each time.
>>
>
> I don't think schema is a hint. But I hear what your saying - hint is a
> type of PCollection metadata as is schema, and we should have a unified API
> for setting such metadata.
>


:), Ismael pointed me out earlier this week that "hint" had an old meaning
in beam. My usage is purely the one done in most EE spec (your "metadata"
in previous answer). But guess we are aligned on the meaning now, just
wanted to be sure.


>
>
>>
>>
>>>
>>> 2. BeamSQL already has a generic record type which fits this use case
>>> very well (though we might modify it). However as mentioned in the doc, the
>>> user is never forced to use this generic record type.
>>>
>>>
>> Well yes and not. A type already exists but 1. it is very strictly
>> limited (flat/columns only which is very few of what big data SQL can do)
>> and 2. it must be aligned on the converge of generic data the schema will
>> bring (really read "aligned" as "dropped in favor of" - deprecated being a
>> smooth way to do it).
>>
>
> As I said the existing class needs to be modified and extended, and not
> just for this schema us was. It was meant to represent Calcite SQL rows,
> but doesn't quite even do that yet (Calcite supports nested rows). However
> I think it's the right basis to start from.
>

Agree on the state. Current impl issues I hit (additionally to the nested
support which would require by itself a kind of visitor solution) are the
fact to own the schema in the record and handle field by field the
serialization instead of as a whole which is how it would be handled with a
schema IMHO.

Concretely what I don't want is to do a PoC which works - they all work
right? and integrate to beam without thinking to a global solution for this
generic record issue and its schema standardization. This is where Json(-P)
has a lot of value IMHO but requires a bit more love than just adding
schema in the model.


>
>
>>
>> So long story short the main work of this schema track is not only on
>> using schema in runners and other ways but also starting to make beam
>> consistent with itself which is probably the most important outcome since
>> it is the user facing side of this work.
>>
>>
>>>
>>> On Sun, Feb 4, 2018 at 12:22 AM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
>>>> @Reuven: is the proto only about passing schema or also the generic
>>>> type?
>>>>
>>>> There are 2.5 topics to solve this issue:
>>>>
>>>> 1. How to pass schema
>>>> 1.a. hints?
>>>> 2. What is the generic record type associated to a schema and how to
>>>> express a schema relatively to it
>>>>
>>>> I would be happy to help on 1.a and 2 somehow if you need.
>>>>
>>>> Le 4 févr. 2018 03:30, "Reuven Lax" <re...@google.com> a écrit :
>>>>
>>>>> One more thing. If anyone here has experience with various OSS
>>>>> metadata stores (e.g. Kafka Schema Registry is one example), would you 
>>>>> like
>>>>> to collaborate on implementation? I want to make sure that source schemas
>>>>> can be stored in a variety of OSS metadata stores, and be easily pulled
>>>>> into a Beam pipeline.
>>>>>
>>>>> Reuven
>>>>>
>>>>> On Sat, Feb 3, 2018 at 6:28 PM, Reuven Lax <re...@google.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> If there are no concerns, I would like to start working on a
>>>>>> prototype. It's just a prototype, so I don't think it will have the final
>>>>>> API (e.g. for the prototype I'm going to avoid change the API of
>>>>>> PCollection, and use a "special" Coder instead). Also even once we go
>>>>>> beyond prototype, it will be @Experimental for some time, so the API will
>>>>>> not be fixed in stone.
>>>>>>
>>>>>> Any more comments on this approach before we start implementing a
>>>>>> prototype?
>>>>>>
>>>>>> Reuven
>>>>>>
>>>>>> On Wed, Jan 31, 2018 at 1:12 PM, Romain Manni-Bucau <
>>>>>> rmannibu...@gmail.com> wrote:
>>>>>>
>>>>>>> If you need help on the json part I'm happy to help. To give a few
>>>>>>> hints on what is very doable: we can add an avro module to johnzon (asf
>>>>>>> json{p,b} impl) to back jsonp by avro (guess it will be one of the 
>>>>>>> first to
>>>>>>> be asked) for instance.
>>>>>>>
>>>>>>>
>>>>>>> Romain Manni-Bucau
>>>>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>>> <http://rmannibucau.wordpress.com> | Github
>>>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>>>> <https://www.linkedin.com/in/rmannibucau>
>>>>>>>
>>>>>>> 2018-01-31 22:06 GMT+01:00 Reuven Lax <re...@google.com>:
>>>>>>>
>>>>>>>> Agree. The initial implementation will be a prototype.
>>>>>>>>
>>>>>>>> On Wed, Jan 31, 2018 at 12:21 PM, Jean-Baptiste Onofré <
>>>>>>>> j...@nanthrax.net> wrote:
>>>>>>>>
>>>>>>>>> Hi Reuven,
>>>>>>>>>
>>>>>>>>> Agree to be able to describe the schema with different format. The
>>>>>>>>> good point about json schemas is that they are described by a spec. My
>>>>>>>>> point is also to avoid the reinvent the wheel. Just an abstract to be 
>>>>>>>>> able
>>>>>>>>> to use Avro, Json, Calcite, custom schema descriptors would be great.
>>>>>>>>>
>>>>>>>>> Using coder to describe a schema sounds like a smart move to
>>>>>>>>> implement quickly. However, it has to be clear in term of 
>>>>>>>>> documentation to
>>>>>>>>> avoid "side effect". I still think PCollection.setSchema() is better: 
>>>>>>>>> it
>>>>>>>>> should be metadata (or hint ;))) on the PCollection.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> JB
>>>>>>>>>
>>>>>>>>> On 31/01/2018 20:16, Reuven Lax wrote:
>>>>>>>>>
>>>>>>>>>> As to the question of how a schema should be specified, I want to
>>>>>>>>>> support several common schema formats. So if a user has a Json 
>>>>>>>>>> schema, or
>>>>>>>>>> an Avro schema, or a Calcite schema, etc. there should be adapters 
>>>>>>>>>> that
>>>>>>>>>> allow setting a schema from any of them. I don't think we should 
>>>>>>>>>> prefer one
>>>>>>>>>> over the other. While Romain is right that many people know Json, I 
>>>>>>>>>> think
>>>>>>>>>> far fewer people know Json schemas.
>>>>>>>>>>
>>>>>>>>>> Agree, schemas should not be enforced (for one thing, that
>>>>>>>>>> wouldn't be backwards compatible!). I think for the initial 
>>>>>>>>>> prototype I
>>>>>>>>>> will probably use a special coder to represent the schema (with 
>>>>>>>>>> setSchema
>>>>>>>>>> an option on the coder), largely because it doesn't require modifying
>>>>>>>>>> PCollection. However I think longer term a schema should be an 
>>>>>>>>>> optional
>>>>>>>>>> piece of metadata on the PCollection object. Similar to the previous
>>>>>>>>>> discussion about "hints," I think this can be set on the producing
>>>>>>>>>> PTransform, and a SetSchema PTransform will allow attaching a schema 
>>>>>>>>>> to any
>>>>>>>>>> PCollection (i.e. pc.apply(SetSchema.of(schema))). This part
>>>>>>>>>> isn't designed yet, but I think schema should be similar to hints, 
>>>>>>>>>> it's
>>>>>>>>>> just another piece of metadata on the PCollection (though something
>>>>>>>>>> interpreted by the model, where hints are interpreted by the runner)
>>>>>>>>>>
>>>>>>>>>> Reuven
>>>>>>>>>>
>>>>>>>>>> On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré <
>>>>>>>>>> j...@nanthrax.net <mailto:j...@nanthrax.net>> wrote:
>>>>>>>>>>
>>>>>>>>>>     Hi,
>>>>>>>>>>
>>>>>>>>>>     I think we should avoid to mix two things in the discussion
>>>>>>>>>> (and so
>>>>>>>>>>     the document):
>>>>>>>>>>
>>>>>>>>>>     1. The element of the collection and the schema itself are two
>>>>>>>>>>     different things.
>>>>>>>>>>     By essence, Beam should not enforce any schema. That's why I
>>>>>>>>>> think
>>>>>>>>>>     it's a good
>>>>>>>>>>     idea to set the schema optionally on the PCollection
>>>>>>>>>>     (pcollection.setSchema()).
>>>>>>>>>>
>>>>>>>>>>     2. From point 1 comes two questions: how do we represent a
>>>>>>>>>> schema ?
>>>>>>>>>>     How can we
>>>>>>>>>>     leverage the schema to simplify the serialization of the
>>>>>>>>>> element in the
>>>>>>>>>>     PCollection and query ? These two questions are not directly
>>>>>>>>>> related.
>>>>>>>>>>
>>>>>>>>>>       2.1 How do we represent the schema
>>>>>>>>>>     Json Schema is a very interesting idea. It could be an
>>>>>>>>>> abstract and
>>>>>>>>>>     other
>>>>>>>>>>     providers, like Avro, can be bind on it. It's part of the json
>>>>>>>>>>     processing spec
>>>>>>>>>>     (javax).
>>>>>>>>>>
>>>>>>>>>>       2.2. How do we leverage the schema for query and
>>>>>>>>>> serialization
>>>>>>>>>>     Also in the spec, json pointer is interesting for the
>>>>>>>>>> querying.
>>>>>>>>>>     Regarding the
>>>>>>>>>>     serialization, jackson or other data binder can be used.
>>>>>>>>>>
>>>>>>>>>>     It's still rough ideas in my mind, but I like Romain's idea
>>>>>>>>>> about
>>>>>>>>>>     json-p usage.
>>>>>>>>>>
>>>>>>>>>>     Once 2.3.0 release is out, I will start to update the
>>>>>>>>>> document with
>>>>>>>>>>     those ideas,
>>>>>>>>>>     and PoC.
>>>>>>>>>>
>>>>>>>>>>     Thanks !
>>>>>>>>>>     Regards
>>>>>>>>>>     JB
>>>>>>>>>>
>>>>>>>>>>     On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote:
>>>>>>>>>>     >
>>>>>>>>>>     >
>>>>>>>>>>     > Le 30 janv. 2018 01:09, "Reuven Lax" <re...@google.com
>>>>>>>>>> <mailto:re...@google.com>
>>>>>>>>>>      > <mailto:re...@google.com <mailto:re...@google.com>>> a
>>>>>>>>>> écrit :
>>>>>>>>>>     >
>>>>>>>>>>     >
>>>>>>>>>>     >
>>>>>>>>>>     >     On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau <
>>>>>>>>>> rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>
>>>>>>>>>>      >     <mailto:rmannibu...@gmail.com
>>>>>>>>>>
>>>>>>>>>>     <mailto:rmannibu...@gmail.com>>> wrote:
>>>>>>>>>>      >
>>>>>>>>>>      >         Hi
>>>>>>>>>>      >
>>>>>>>>>>      >         I have some questions on this: how hierarchic
>>>>>>>>>> schemas
>>>>>>>>>>     would work? Seems
>>>>>>>>>>      >         it is not really supported by the ecosystem (out of
>>>>>>>>>>     custom stuff) :(.
>>>>>>>>>>      >         How would it integrate smoothly with other generic
>>>>>>>>>> record
>>>>>>>>>>     types - N bridges?
>>>>>>>>>>      >
>>>>>>>>>>      >
>>>>>>>>>>      >     Do you mean nested schemas? What do you mean here?
>>>>>>>>>>      >
>>>>>>>>>>      >
>>>>>>>>>>      > Yes, sorry - wrote the mail too late ;). Was hierarchic
>>>>>>>>>> data and
>>>>>>>>>>     nested schemas.
>>>>>>>>>>      >
>>>>>>>>>>      >
>>>>>>>>>>      >         Concretely I wonder if using json API couldnt be
>>>>>>>>>>     beneficial: json-p is a
>>>>>>>>>>      >         nice generic abstraction with a built in querying
>>>>>>>>>>     mecanism (jsonpointer)
>>>>>>>>>>      >         but no actual serialization (even if json and
>>>>>>>>>> binary json
>>>>>>>>>>     are very
>>>>>>>>>>      >         natural). The big advantage is to have a well known
>>>>>>>>>>     ecosystem - who
>>>>>>>>>>      >         doesnt know json today? - that beam can reuse for
>>>>>>>>>> free:
>>>>>>>>>>     JsonObject
>>>>>>>>>>      >         (guess we dont want JsonValue abstraction) for the
>>>>>>>>>> record
>>>>>>>>>>     type,
>>>>>>>>>>      >         jsonschema standard for the schema, jsonpointer
>>>>>>>>>> for the
>>>>>>>>>>      >         delection/projection etc... It doesnt enforce the
>>>>>>>>>> actual
>>>>>>>>>>     serialization
>>>>>>>>>>      >         (json, smile, avro, ...) but provide an expressive
>>>>>>>>>> and
>>>>>>>>>>     alread known API
>>>>>>>>>>      >         so i see it as a big win-win for users (no need to
>>>>>>>>>> learn
>>>>>>>>>>     a new API and
>>>>>>>>>>      >         use N bridges in all ways) and beam (impls are
>>>>>>>>>> here and
>>>>>>>>>>     API design
>>>>>>>>>>      >         already thought).
>>>>>>>>>>      >
>>>>>>>>>>      >
>>>>>>>>>>      >     I assume you're talking about the API for setting
>>>>>>>>>> schemas,
>>>>>>>>>>     not using them.
>>>>>>>>>>      >     Json has many downsides and I'm not sure it's true that
>>>>>>>>>>     everyone knows it;
>>>>>>>>>>      >     there are also competing schema APIs, such as Avro
>>>>>>>>>> etc..
>>>>>>>>>>     However I think we
>>>>>>>>>>      >     should give Json a fair evaluation before dismissing
>>>>>>>>>> it.
>>>>>>>>>>      >
>>>>>>>>>>      >
>>>>>>>>>>      > It is a wider topic than schema. Actually schema are not
>>>>>>>>>> the
>>>>>>>>>>     first citizen but a
>>>>>>>>>>      > generic data representation is. That is where json hits
>>>>>>>>>> almost
>>>>>>>>>>     any other API.
>>>>>>>>>>      > Then, when it comes to schema, json has a standard for
>>>>>>>>>> that so we
>>>>>>>>>>     are all good.
>>>>>>>>>>      >
>>>>>>>>>>      > Also json has a good indexing API compared to alternatives
>>>>>>>>>> which
>>>>>>>>>>     are sometimes a
>>>>>>>>>>      > bit faster - for noop transforms - but are hardly usable
>>>>>>>>>> or make
>>>>>>>>>>     the code not
>>>>>>>>>>      > that readable.
>>>>>>>>>>      >
>>>>>>>>>>      > Avro is a nice competitor but it is compatible - actually
>>>>>>>>>> avro is
>>>>>>>>>>     json driven by
>>>>>>>>>>      > design - but its API is far to be that easy due to its
>>>>>>>>>> schema
>>>>>>>>>>     enforcement which
>>>>>>>>>>      > is heavvvyyy and worse is you cant work with avro without a
>>>>>>>>>>     schema. Json would
>>>>>>>>>>      > allow to reconciliate the dynamic and static cases since
>>>>>>>>>> the job
>>>>>>>>>>     wouldnt change
>>>>>>>>>>      > except the setschema.
>>>>>>>>>>      >
>>>>>>>>>>      > That is why I think json is a good compromise and having a
>>>>>>>>>>     standard API for it
>>>>>>>>>>      > allow to fully customize the imol as will if needed - even
>>>>>>>>>> using
>>>>>>>>>>     avro or protobuf.
>>>>>>>>>>      >
>>>>>>>>>>      > Side note on beam api: i dont think it is good to use a
>>>>>>>>>> main API
>>>>>>>>>>     for runner
>>>>>>>>>>      > optimization. It enforces something to be shared on all
>>>>>>>>>> runners
>>>>>>>>>>     but not widely
>>>>>>>>>>      > usable. It is also misleading for users. Would you set a
>>>>>>>>>> flink
>>>>>>>>>>     pipeline option
>>>>>>>>>>      > with dataflow? My proposal here is to use hints -
>>>>>>>>>> properties -
>>>>>>>>>>     instead of
>>>>>>>>>>      > something hardly defined in the API then standardize it if
>>>>>>>>>> all
>>>>>>>>>>     runners support it.
>>>>>>>>>>      >
>>>>>>>>>>      >
>>>>>>>>>>      >
>>>>>>>>>>      >         Wdyt?
>>>>>>>>>>      >
>>>>>>>>>>      >         Le 29 janv. 2018 06:24, "Jean-Baptiste Onofré"
>>>>>>>>>>     <j...@nanthrax.net <mailto:j...@nanthrax.net>
>>>>>>>>>>      >         <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>>
>>>>>>>>>> a écrit :
>>>>>>>>>>
>>>>>>>>>>      >
>>>>>>>>>>      >             Hi Reuven,
>>>>>>>>>>      >
>>>>>>>>>>      >             Thanks for the update ! As I'm working with
>>>>>>>>>> you on
>>>>>>>>>>     this, I fully
>>>>>>>>>>      >             agree and great
>>>>>>>>>>      >             doc gathering the ideas.
>>>>>>>>>>      >
>>>>>>>>>>      >             It's clearly something we have to add asap in
>>>>>>>>>> Beam,
>>>>>>>>>>     because it would
>>>>>>>>>>      >             allow new
>>>>>>>>>>      >             use cases for our users (in a simple way) and
>>>>>>>>>> open
>>>>>>>>>>     new areas for the
>>>>>>>>>>      >             runners
>>>>>>>>>>      >             (for instance dataframe support in the Spark
>>>>>>>>>> runner).
>>>>>>>>>>      >
>>>>>>>>>>      >             By the way, while ago, I created BEAM-3437 to
>>>>>>>>>> track
>>>>>>>>>>     the PoC/PR
>>>>>>>>>>      >             around this.
>>>>>>>>>>      >
>>>>>>>>>>      >             Thanks !
>>>>>>>>>>      >
>>>>>>>>>>      >             Regards
>>>>>>>>>>      >             JB
>>>>>>>>>>      >
>>>>>>>>>>      >             On 01/29/2018 02:08 AM, Reuven Lax wrote:
>>>>>>>>>>      >             > Previously I submitted a proposal for adding
>>>>>>>>>>     schemas as a
>>>>>>>>>>      >             first-class concept on
>>>>>>>>>>      >             > Beam PCollections. The proposal engendered
>>>>>>>>>> quite a
>>>>>>>>>>     bit of
>>>>>>>>>>      >             discussion from the
>>>>>>>>>>      >             > community - more discussion than I've seen
>>>>>>>>>> from
>>>>>>>>>>     almost any of our
>>>>>>>>>>      >             proposals to
>>>>>>>>>>      >             > date!
>>>>>>>>>>      >             >
>>>>>>>>>>      >             > Based on the feedback and comments, I
>>>>>>>>>> reworked the
>>>>>>>>>>     proposal
>>>>>>>>>>      >             document quite a
>>>>>>>>>>      >             > bit. It now talks more explicitly about the
>>>>>>>>>>     different between
>>>>>>>>>>      >             dynamic schemas
>>>>>>>>>>      >             > (where the schema is not fully not know at
>>>>>>>>>>     graph-creation time),
>>>>>>>>>>      >             and static
>>>>>>>>>>      >             > schemas (which are fully know at
>>>>>>>>>> graph-creation
>>>>>>>>>>     time). Proposed
>>>>>>>>>>      >             APIs are more
>>>>>>>>>>      >             > fleshed out now (again thanks to feedback
>>>>>>>>>> from
>>>>>>>>>>     community members),
>>>>>>>>>>      >             and the
>>>>>>>>>>      >             > document talks in more detail about evolving
>>>>>>>>>> schemas in
>>>>>>>>>>      >             long-running streaming
>>>>>>>>>>      >             > pipelines.
>>>>>>>>>>      >             >
>>>>>>>>>>      >             > Please take a look. I think this will be very
>>>>>>>>>>     valuable to Beam,
>>>>>>>>>>      >             and welcome any
>>>>>>>>>>      >             > feedback.
>>>>>>>>>>      >             >
>>>>>>>>>>      >             >
>>>>>>>>>>      >
>>>>>>>>>>     https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ
>>>>>>>>>> 12pHGK0QIvXS1FOTgRc/edit#
>>>>>>>>>>     <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm
>>>>>>>>>> Q12pHGK0QIvXS1FOTgRc/edit#>
>>>>>>>>>>      >                 <https://docs.google.com/docu
>>>>>>>>>> ment/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit# <
>>>>>>>>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm
>>>>>>>>>> Q12pHGK0QIvXS1FOTgRc/edit#>>
>>>>>>>>>>      >             >
>>>>>>>>>>      >             > Reuven
>>>>>>>>>>      >
>>>>>>>>>>      >             --
>>>>>>>>>>      >             Jean-Baptiste Onofré
>>>>>>>>>>      > jbono...@apache.org <mailto:jbono...@apache.org>
>>>>>>>>>>     <mailto:jbono...@apache.org <mailto:jbono...@apache.org>>
>>>>>>>>>>      > http://blog.nanthrax.net
>>>>>>>>>>      >             Talend - http://www.talend.com
>>>>>>>>>>      >
>>>>>>>>>>      >
>>>>>>>>>>      >
>>>>>>>>>>
>>>>>>>>>>     --
>>>>>>>>>>     Jean-Baptiste Onofré
>>>>>>>>>>     jbono...@apache.org <mailto:jbono...@apache.org>
>>>>>>>>>>     http://blog.nanthrax.net
>>>>>>>>>>     Talend - http://www.talend.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>
>

Re: Schema-Aware PCollections revisited

Reply via email to