Re: Schema-Aware PCollections revisited

Reuven Lax Sun, 04 Feb 2018 08:53:41 -0800

On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:


>
> 2018-02-04 17:37 GMT+01:00 Reuven Lax <re...@google.com>:
>
>> I'm not sure where proto comes from here. Proto is one example of a type
>> that has a schema, but only one example.
>>
>> 1. In the initial prototype I want to avoid modifying the PCollection
>> API. So I think it's best to create a special SchemaCoder, and pass the
>> schema into this coder. Later we might targeted APIs for this instead of
>> going through a coder.
>> 1.a I don't see what hints have to do with this?
>>
>
> Hints are a way to replace the new API and unify the way to pass metadata
> in beam instead of adding a new custom way each time.
>

I don't think schema is a hint. But I hear what your saying - hint is a
type of PCollection metadata as is schema, and we should have a unified API
for setting such metadata.


>
>
>>
>> 2. BeamSQL already has a generic record type which fits this use case
>> very well (though we might modify it). However as mentioned in the doc, the
>> user is never forced to use this generic record type.
>>
>>
> Well yes and not. A type already exists but 1. it is very strictly limited
> (flat/columns only which is very few of what big data SQL can do) and 2. it
> must be aligned on the converge of generic data the schema will bring
> (really read "aligned" as "dropped in favor of" - deprecated being a smooth
> way to do it).
>

As I said the existing class needs to be modified and extended, and not
just for this schema us was. It was meant to represent Calcite SQL rows,
but doesn't quite even do that yet (Calcite supports nested rows). However
I think it's the right basis to start from.


>
> So long story short the main work of this schema track is not only on
> using schema in runners and other ways but also starting to make beam
> consistent with itself which is probably the most important outcome since
> it is the user facing side of this work.
>
>
>>
>> On Sun, Feb 4, 2018 at 12:22 AM, Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>> @Reuven: is the proto only about passing schema or also the generic type?
>>>
>>> There are 2.5 topics to solve this issue:
>>>
>>> 1. How to pass schema
>>> 1.a. hints?
>>> 2. What is the generic record type associated to a schema and how to
>>> express a schema relatively to it
>>>
>>> I would be happy to help on 1.a and 2 somehow if you need.
>>>
>>> Le 4 févr. 2018 03:30, "Reuven Lax" <re...@google.com> a écrit :
>>>
>>>> One more thing. If anyone here has experience with various OSS metadata
>>>> stores (e.g. Kafka Schema Registry is one example), would you like to
>>>> collaborate on implementation? I want to make sure that source schemas can
>>>> be stored in a variety of OSS metadata stores, and be easily pulled into a
>>>> Beam pipeline.
>>>>
>>>> Reuven
>>>>
>>>> On Sat, Feb 3, 2018 at 6:28 PM, Reuven Lax <re...@google.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> If there are no concerns, I would like to start working on a
>>>>> prototype. It's just a prototype, so I don't think it will have the final
>>>>> API (e.g. for the prototype I'm going to avoid change the API of
>>>>> PCollection, and use a "special" Coder instead). Also even once we go
>>>>> beyond prototype, it will be @Experimental for some time, so the API will
>>>>> not be fixed in stone.
>>>>>
>>>>> Any more comments on this approach before we start implementing a
>>>>> prototype?
>>>>>
>>>>> Reuven
>>>>>
>>>>> On Wed, Jan 31, 2018 at 1:12 PM, Romain Manni-Bucau <
>>>>> rmannibu...@gmail.com> wrote:
>>>>>
>>>>>> If you need help on the json part I'm happy to help. To give a few
>>>>>> hints on what is very doable: we can add an avro module to johnzon (asf
>>>>>> json{p,b} impl) to back jsonp by avro (guess it will be one of the first 
>>>>>> to
>>>>>> be asked) for instance.
>>>>>>
>>>>>>
>>>>>> Romain Manni-Bucau
>>>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>> <http://rmannibucau.wordpress.com> | Github
>>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>>> <https://www.linkedin.com/in/rmannibucau>
>>>>>>
>>>>>> 2018-01-31 22:06 GMT+01:00 Reuven Lax <re...@google.com>:
>>>>>>
>>>>>>> Agree. The initial implementation will be a prototype.
>>>>>>>
>>>>>>> On Wed, Jan 31, 2018 at 12:21 PM, Jean-Baptiste Onofré <
>>>>>>> j...@nanthrax.net> wrote:
>>>>>>>
>>>>>>>> Hi Reuven,
>>>>>>>>
>>>>>>>> Agree to be able to describe the schema with different format. The
>>>>>>>> good point about json schemas is that they are described by a spec. My
>>>>>>>> point is also to avoid the reinvent the wheel. Just an abstract to be 
>>>>>>>> able
>>>>>>>> to use Avro, Json, Calcite, custom schema descriptors would be great.
>>>>>>>>
>>>>>>>> Using coder to describe a schema sounds like a smart move to
>>>>>>>> implement quickly. However, it has to be clear in term of 
>>>>>>>> documentation to
>>>>>>>> avoid "side effect". I still think PCollection.setSchema() is better: 
>>>>>>>> it
>>>>>>>> should be metadata (or hint ;))) on the PCollection.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> JB
>>>>>>>>
>>>>>>>> On 31/01/2018 20:16, Reuven Lax wrote:
>>>>>>>>
>>>>>>>>> As to the question of how a schema should be specified, I want to
>>>>>>>>> support several common schema formats. So if a user has a Json 
>>>>>>>>> schema, or
>>>>>>>>> an Avro schema, or a Calcite schema, etc. there should be adapters 
>>>>>>>>> that
>>>>>>>>> allow setting a schema from any of them. I don't think we should 
>>>>>>>>> prefer one
>>>>>>>>> over the other. While Romain is right that many people know Json, I 
>>>>>>>>> think
>>>>>>>>> far fewer people know Json schemas.
>>>>>>>>>
>>>>>>>>> Agree, schemas should not be enforced (for one thing, that
>>>>>>>>> wouldn't be backwards compatible!). I think for the initial prototype 
>>>>>>>>> I
>>>>>>>>> will probably use a special coder to represent the schema (with 
>>>>>>>>> setSchema
>>>>>>>>> an option on the coder), largely because it doesn't require modifying
>>>>>>>>> PCollection. However I think longer term a schema should be an 
>>>>>>>>> optional
>>>>>>>>> piece of metadata on the PCollection object. Similar to the previous
>>>>>>>>> discussion about "hints," I think this can be set on the producing
>>>>>>>>> PTransform, and a SetSchema PTransform will allow attaching a schema 
>>>>>>>>> to any
>>>>>>>>> PCollection (i.e. pc.apply(SetSchema.of(schema))). This part
>>>>>>>>> isn't designed yet, but I think schema should be similar to hints, 
>>>>>>>>> it's
>>>>>>>>> just another piece of metadata on the PCollection (though something
>>>>>>>>> interpreted by the model, where hints are interpreted by the runner)
>>>>>>>>>
>>>>>>>>> Reuven
>>>>>>>>>
>>>>>>>>> On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré <
>>>>>>>>> j...@nanthrax.net <mailto:j...@nanthrax.net>> wrote:
>>>>>>>>>
>>>>>>>>>     Hi,
>>>>>>>>>
>>>>>>>>>     I think we should avoid to mix two things in the discussion
>>>>>>>>> (and so
>>>>>>>>>     the document):
>>>>>>>>>
>>>>>>>>>     1. The element of the collection and the schema itself are two
>>>>>>>>>     different things.
>>>>>>>>>     By essence, Beam should not enforce any schema. That's why I
>>>>>>>>> think
>>>>>>>>>     it's a good
>>>>>>>>>     idea to set the schema optionally on the PCollection
>>>>>>>>>     (pcollection.setSchema()).
>>>>>>>>>
>>>>>>>>>     2. From point 1 comes two questions: how do we represent a
>>>>>>>>> schema ?
>>>>>>>>>     How can we
>>>>>>>>>     leverage the schema to simplify the serialization of the
>>>>>>>>> element in the
>>>>>>>>>     PCollection and query ? These two questions are not directly
>>>>>>>>> related.
>>>>>>>>>
>>>>>>>>>       2.1 How do we represent the schema
>>>>>>>>>     Json Schema is a very interesting idea. It could be an
>>>>>>>>> abstract and
>>>>>>>>>     other
>>>>>>>>>     providers, like Avro, can be bind on it. It's part of the json
>>>>>>>>>     processing spec
>>>>>>>>>     (javax).
>>>>>>>>>
>>>>>>>>>       2.2. How do we leverage the schema for query and
>>>>>>>>> serialization
>>>>>>>>>     Also in the spec, json pointer is interesting for the querying.
>>>>>>>>>     Regarding the
>>>>>>>>>     serialization, jackson or other data binder can be used.
>>>>>>>>>
>>>>>>>>>     It's still rough ideas in my mind, but I like Romain's idea
>>>>>>>>> about
>>>>>>>>>     json-p usage.
>>>>>>>>>
>>>>>>>>>     Once 2.3.0 release is out, I will start to update the document
>>>>>>>>> with
>>>>>>>>>     those ideas,
>>>>>>>>>     and PoC.
>>>>>>>>>
>>>>>>>>>     Thanks !
>>>>>>>>>     Regards
>>>>>>>>>     JB
>>>>>>>>>
>>>>>>>>>     On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote:
>>>>>>>>>     >
>>>>>>>>>     >
>>>>>>>>>     > Le 30 janv. 2018 01:09, "Reuven Lax" <re...@google.com
>>>>>>>>> <mailto:re...@google.com>
>>>>>>>>>      > <mailto:re...@google.com <mailto:re...@google.com>>> a
>>>>>>>>> écrit :
>>>>>>>>>     >
>>>>>>>>>     >
>>>>>>>>>     >
>>>>>>>>>     >     On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau <
>>>>>>>>> rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>
>>>>>>>>>      >     <mailto:rmannibu...@gmail.com
>>>>>>>>>
>>>>>>>>>     <mailto:rmannibu...@gmail.com>>> wrote:
>>>>>>>>>      >
>>>>>>>>>      >         Hi
>>>>>>>>>      >
>>>>>>>>>      >         I have some questions on this: how hierarchic
>>>>>>>>> schemas
>>>>>>>>>     would work? Seems
>>>>>>>>>      >         it is not really supported by the ecosystem (out of
>>>>>>>>>     custom stuff) :(.
>>>>>>>>>      >         How would it integrate smoothly with other generic
>>>>>>>>> record
>>>>>>>>>     types - N bridges?
>>>>>>>>>      >
>>>>>>>>>      >
>>>>>>>>>      >     Do you mean nested schemas? What do you mean here?
>>>>>>>>>      >
>>>>>>>>>      >
>>>>>>>>>      > Yes, sorry - wrote the mail too late ;). Was hierarchic
>>>>>>>>> data and
>>>>>>>>>     nested schemas.
>>>>>>>>>      >
>>>>>>>>>      >
>>>>>>>>>      >         Concretely I wonder if using json API couldnt be
>>>>>>>>>     beneficial: json-p is a
>>>>>>>>>      >         nice generic abstraction with a built in querying
>>>>>>>>>     mecanism (jsonpointer)
>>>>>>>>>      >         but no actual serialization (even if json and
>>>>>>>>> binary json
>>>>>>>>>     are very
>>>>>>>>>      >         natural). The big advantage is to have a well known
>>>>>>>>>     ecosystem - who
>>>>>>>>>      >         doesnt know json today? - that beam can reuse for
>>>>>>>>> free:
>>>>>>>>>     JsonObject
>>>>>>>>>      >         (guess we dont want JsonValue abstraction) for the
>>>>>>>>> record
>>>>>>>>>     type,
>>>>>>>>>      >         jsonschema standard for the schema, jsonpointer for
>>>>>>>>> the
>>>>>>>>>      >         delection/projection etc... It doesnt enforce the
>>>>>>>>> actual
>>>>>>>>>     serialization
>>>>>>>>>      >         (json, smile, avro, ...) but provide an expressive
>>>>>>>>> and
>>>>>>>>>     alread known API
>>>>>>>>>      >         so i see it as a big win-win for users (no need to
>>>>>>>>> learn
>>>>>>>>>     a new API and
>>>>>>>>>      >         use N bridges in all ways) and beam (impls are here
>>>>>>>>> and
>>>>>>>>>     API design
>>>>>>>>>      >         already thought).
>>>>>>>>>      >
>>>>>>>>>      >
>>>>>>>>>      >     I assume you're talking about the API for setting
>>>>>>>>> schemas,
>>>>>>>>>     not using them.
>>>>>>>>>      >     Json has many downsides and I'm not sure it's true that
>>>>>>>>>     everyone knows it;
>>>>>>>>>      >     there are also competing schema APIs, such as Avro etc..
>>>>>>>>>     However I think we
>>>>>>>>>      >     should give Json a fair evaluation before dismissing it.
>>>>>>>>>      >
>>>>>>>>>      >
>>>>>>>>>      > It is a wider topic than schema. Actually schema are not the
>>>>>>>>>     first citizen but a
>>>>>>>>>      > generic data representation is. That is where json hits
>>>>>>>>> almost
>>>>>>>>>     any other API.
>>>>>>>>>      > Then, when it comes to schema, json has a standard for that
>>>>>>>>> so we
>>>>>>>>>     are all good.
>>>>>>>>>      >
>>>>>>>>>      > Also json has a good indexing API compared to alternatives
>>>>>>>>> which
>>>>>>>>>     are sometimes a
>>>>>>>>>      > bit faster - for noop transforms - but are hardly usable or
>>>>>>>>> make
>>>>>>>>>     the code not
>>>>>>>>>      > that readable.
>>>>>>>>>      >
>>>>>>>>>      > Avro is a nice competitor but it is compatible - actually
>>>>>>>>> avro is
>>>>>>>>>     json driven by
>>>>>>>>>      > design - but its API is far to be that easy due to its
>>>>>>>>> schema
>>>>>>>>>     enforcement which
>>>>>>>>>      > is heavvvyyy and worse is you cant work with avro without a
>>>>>>>>>     schema. Json would
>>>>>>>>>      > allow to reconciliate the dynamic and static cases since
>>>>>>>>> the job
>>>>>>>>>     wouldnt change
>>>>>>>>>      > except the setschema.
>>>>>>>>>      >
>>>>>>>>>      > That is why I think json is a good compromise and having a
>>>>>>>>>     standard API for it
>>>>>>>>>      > allow to fully customize the imol as will if needed - even
>>>>>>>>> using
>>>>>>>>>     avro or protobuf.
>>>>>>>>>      >
>>>>>>>>>      > Side note on beam api: i dont think it is good to use a
>>>>>>>>> main API
>>>>>>>>>     for runner
>>>>>>>>>      > optimization. It enforces something to be shared on all
>>>>>>>>> runners
>>>>>>>>>     but not widely
>>>>>>>>>      > usable. It is also misleading for users. Would you set a
>>>>>>>>> flink
>>>>>>>>>     pipeline option
>>>>>>>>>      > with dataflow? My proposal here is to use hints -
>>>>>>>>> properties -
>>>>>>>>>     instead of
>>>>>>>>>      > something hardly defined in the API then standardize it if
>>>>>>>>> all
>>>>>>>>>     runners support it.
>>>>>>>>>      >
>>>>>>>>>      >
>>>>>>>>>      >
>>>>>>>>>      >         Wdyt?
>>>>>>>>>      >
>>>>>>>>>      >         Le 29 janv. 2018 06:24, "Jean-Baptiste Onofré"
>>>>>>>>>     <j...@nanthrax.net <mailto:j...@nanthrax.net>
>>>>>>>>>      >         <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>>
>>>>>>>>> a écrit :
>>>>>>>>>
>>>>>>>>>      >
>>>>>>>>>      >             Hi Reuven,
>>>>>>>>>      >
>>>>>>>>>      >             Thanks for the update ! As I'm working with you
>>>>>>>>> on
>>>>>>>>>     this, I fully
>>>>>>>>>      >             agree and great
>>>>>>>>>      >             doc gathering the ideas.
>>>>>>>>>      >
>>>>>>>>>      >             It's clearly something we have to add asap in
>>>>>>>>> Beam,
>>>>>>>>>     because it would
>>>>>>>>>      >             allow new
>>>>>>>>>      >             use cases for our users (in a simple way) and
>>>>>>>>> open
>>>>>>>>>     new areas for the
>>>>>>>>>      >             runners
>>>>>>>>>      >             (for instance dataframe support in the Spark
>>>>>>>>> runner).
>>>>>>>>>      >
>>>>>>>>>      >             By the way, while ago, I created BEAM-3437 to
>>>>>>>>> track
>>>>>>>>>     the PoC/PR
>>>>>>>>>      >             around this.
>>>>>>>>>      >
>>>>>>>>>      >             Thanks !
>>>>>>>>>      >
>>>>>>>>>      >             Regards
>>>>>>>>>      >             JB
>>>>>>>>>      >
>>>>>>>>>      >             On 01/29/2018 02:08 AM, Reuven Lax wrote:
>>>>>>>>>      >             > Previously I submitted a proposal for adding
>>>>>>>>>     schemas as a
>>>>>>>>>      >             first-class concept on
>>>>>>>>>      >             > Beam PCollections. The proposal engendered
>>>>>>>>> quite a
>>>>>>>>>     bit of
>>>>>>>>>      >             discussion from the
>>>>>>>>>      >             > community - more discussion than I've seen
>>>>>>>>> from
>>>>>>>>>     almost any of our
>>>>>>>>>      >             proposals to
>>>>>>>>>      >             > date!
>>>>>>>>>      >             >
>>>>>>>>>      >             > Based on the feedback and comments, I
>>>>>>>>> reworked the
>>>>>>>>>     proposal
>>>>>>>>>      >             document quite a
>>>>>>>>>      >             > bit. It now talks more explicitly about the
>>>>>>>>>     different between
>>>>>>>>>      >             dynamic schemas
>>>>>>>>>      >             > (where the schema is not fully not know at
>>>>>>>>>     graph-creation time),
>>>>>>>>>      >             and static
>>>>>>>>>      >             > schemas (which are fully know at
>>>>>>>>> graph-creation
>>>>>>>>>     time). Proposed
>>>>>>>>>      >             APIs are more
>>>>>>>>>      >             > fleshed out now (again thanks to feedback from
>>>>>>>>>     community members),
>>>>>>>>>      >             and the
>>>>>>>>>      >             > document talks in more detail about evolving
>>>>>>>>> schemas in
>>>>>>>>>      >             long-running streaming
>>>>>>>>>      >             > pipelines.
>>>>>>>>>      >             >
>>>>>>>>>      >             > Please take a look. I think this will be very
>>>>>>>>>     valuable to Beam,
>>>>>>>>>      >             and welcome any
>>>>>>>>>      >             > feedback.
>>>>>>>>>      >             >
>>>>>>>>>      >             >
>>>>>>>>>      >
>>>>>>>>>     https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ
>>>>>>>>> 12pHGK0QIvXS1FOTgRc/edit#
>>>>>>>>>     <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm
>>>>>>>>> Q12pHGK0QIvXS1FOTgRc/edit#>
>>>>>>>>>      >                 <https://docs.google.com/docu
>>>>>>>>> ment/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit# <
>>>>>>>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm
>>>>>>>>> Q12pHGK0QIvXS1FOTgRc/edit#>>
>>>>>>>>>      >             >
>>>>>>>>>      >             > Reuven
>>>>>>>>>      >
>>>>>>>>>      >             --
>>>>>>>>>      >             Jean-Baptiste Onofré
>>>>>>>>>      > jbono...@apache.org <mailto:jbono...@apache.org>
>>>>>>>>>     <mailto:jbono...@apache.org <mailto:jbono...@apache.org>>
>>>>>>>>>      > http://blog.nanthrax.net
>>>>>>>>>      >             Talend - http://www.talend.com
>>>>>>>>>      >
>>>>>>>>>      >
>>>>>>>>>      >
>>>>>>>>>
>>>>>>>>>     --
>>>>>>>>>     Jean-Baptiste Onofré
>>>>>>>>>     jbono...@apache.org <mailto:jbono...@apache.org>
>>>>>>>>>     http://blog.nanthrax.net
>>>>>>>>>     Talend - http://www.talend.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
>

Re: Schema-Aware PCollections revisited

Reply via email to