Re: Schema-Aware PCollections revisited

Reuven Lax Sun, 04 Feb 2018 08:38:31 -0800

I'm not sure where proto comes from here. Proto is one example of a type
that has a schema, but only one example.


1. In the initial prototype I want to avoid modifying the PCollection API.
So I think it's best to create a special SchemaCoder, and pass the schema
into this coder. Later we might targeted APIs for this instead of going
through a coder.
1.a I don't see what hints have to do with this?

2. BeamSQL already has a generic record type which fits this use case very
well (though we might modify it). However as mentioned in the doc, the user
is never forced to use this generic record type.

On Sun, Feb 4, 2018 at 12:22 AM, Romain Manni-Bucau <[email protected]>
wrote:

> @Reuven: is the proto only about passing schema or also the generic type?
>
> There are 2.5 topics to solve this issue:
>
> 1. How to pass schema
> 1.a. hints?
> 2. What is the generic record type associated to a schema and how to
> express a schema relatively to it
>
> I would be happy to help on 1.a and 2 somehow if you need.
>
> Le 4 févr. 2018 03:30, "Reuven Lax" <[email protected]> a écrit :
>
>> One more thing. If anyone here has experience with various OSS metadata
>> stores (e.g. Kafka Schema Registry is one example), would you like to
>> collaborate on implementation? I want to make sure that source schemas can
>> be stored in a variety of OSS metadata stores, and be easily pulled into a
>> Beam pipeline.
>>
>> Reuven
>>
>> On Sat, Feb 3, 2018 at 6:28 PM, Reuven Lax <[email protected]> wrote:
>>
>>> Hi all,
>>>
>>> If there are no concerns, I would like to start working on a prototype.
>>> It's just a prototype, so I don't think it will have the final API (e.g.
>>> for the prototype I'm going to avoid change the API of PCollection, and use
>>> a "special" Coder instead). Also even once we go beyond prototype, it will
>>> be @Experimental for some time, so the API will not be fixed in stone.
>>>
>>> Any more comments on this approach before we start implementing a
>>> prototype?
>>>
>>> Reuven
>>>
>>> On Wed, Jan 31, 2018 at 1:12 PM, Romain Manni-Bucau <
>>> [email protected]> wrote:
>>>
>>>> If you need help on the json part I'm happy to help. To give a few
>>>> hints on what is very doable: we can add an avro module to johnzon (asf
>>>> json{p,b} impl) to back jsonp by avro (guess it will be one of the first to
>>>> be asked) for instance.
>>>>
>>>>
>>>> Romain Manni-Bucau
>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>> <http://rmannibucau.wordpress.com> | Github
>>>> <https://github.com/rmannibucau> | LinkedIn
>>>> <https://www.linkedin.com/in/rmannibucau>
>>>>
>>>> 2018-01-31 22:06 GMT+01:00 Reuven Lax <[email protected]>:
>>>>
>>>>> Agree. The initial implementation will be a prototype.
>>>>>
>>>>> On Wed, Jan 31, 2018 at 12:21 PM, Jean-Baptiste Onofré <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Reuven,
>>>>>>
>>>>>> Agree to be able to describe the schema with different format. The
>>>>>> good point about json schemas is that they are described by a spec. My
>>>>>> point is also to avoid the reinvent the wheel. Just an abstract to be 
>>>>>> able
>>>>>> to use Avro, Json, Calcite, custom schema descriptors would be great.
>>>>>>
>>>>>> Using coder to describe a schema sounds like a smart move to
>>>>>> implement quickly. However, it has to be clear in term of documentation 
>>>>>> to
>>>>>> avoid "side effect". I still think PCollection.setSchema() is better: it
>>>>>> should be metadata (or hint ;))) on the PCollection.
>>>>>>
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>> On 31/01/2018 20:16, Reuven Lax wrote:
>>>>>>
>>>>>>> As to the question of how a schema should be specified, I want to
>>>>>>> support several common schema formats. So if a user has a Json schema, 
>>>>>>> or
>>>>>>> an Avro schema, or a Calcite schema, etc. there should be adapters that
>>>>>>> allow setting a schema from any of them. I don't think we should prefer 
>>>>>>> one
>>>>>>> over the other. While Romain is right that many people know Json, I 
>>>>>>> think
>>>>>>> far fewer people know Json schemas.
>>>>>>>
>>>>>>> Agree, schemas should not be enforced (for one thing, that wouldn't
>>>>>>> be backwards compatible!). I think for the initial prototype I will
>>>>>>> probably use a special coder to represent the schema (with setSchema an
>>>>>>> option on the coder), largely because it doesn't require modifying
>>>>>>> PCollection. However I think longer term a schema should be an optional
>>>>>>> piece of metadata on the PCollection object. Similar to the previous
>>>>>>> discussion about "hints," I think this can be set on the producing
>>>>>>> PTransform, and a SetSchema PTransform will allow attaching a schema to 
>>>>>>> any
>>>>>>> PCollection (i.e. pc.apply(SetSchema.of(schema))). This part isn't
>>>>>>> designed yet, but I think schema should be similar to hints, it's just
>>>>>>> another piece of metadata on the PCollection (though something 
>>>>>>> interpreted
>>>>>>> by the model, where hints are interpreted by the runner)
>>>>>>>
>>>>>>> Reuven
>>>>>>>
>>>>>>> On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré <
>>>>>>> [email protected] <mailto:[email protected]>> wrote:
>>>>>>>
>>>>>>>     Hi,
>>>>>>>
>>>>>>>     I think we should avoid to mix two things in the discussion (and
>>>>>>> so
>>>>>>>     the document):
>>>>>>>
>>>>>>>     1. The element of the collection and the schema itself are two
>>>>>>>     different things.
>>>>>>>     By essence, Beam should not enforce any schema. That's why I
>>>>>>> think
>>>>>>>     it's a good
>>>>>>>     idea to set the schema optionally on the PCollection
>>>>>>>     (pcollection.setSchema()).
>>>>>>>
>>>>>>>     2. From point 1 comes two questions: how do we represent a
>>>>>>> schema ?
>>>>>>>     How can we
>>>>>>>     leverage the schema to simplify the serialization of the element
>>>>>>> in the
>>>>>>>     PCollection and query ? These two questions are not directly
>>>>>>> related.
>>>>>>>
>>>>>>>       2.1 How do we represent the schema
>>>>>>>     Json Schema is a very interesting idea. It could be an abstract
>>>>>>> and
>>>>>>>     other
>>>>>>>     providers, like Avro, can be bind on it. It's part of the json
>>>>>>>     processing spec
>>>>>>>     (javax).
>>>>>>>
>>>>>>>       2.2. How do we leverage the schema for query and serialization
>>>>>>>     Also in the spec, json pointer is interesting for the querying.
>>>>>>>     Regarding the
>>>>>>>     serialization, jackson or other data binder can be used.
>>>>>>>
>>>>>>>     It's still rough ideas in my mind, but I like Romain's idea about
>>>>>>>     json-p usage.
>>>>>>>
>>>>>>>     Once 2.3.0 release is out, I will start to update the document
>>>>>>> with
>>>>>>>     those ideas,
>>>>>>>     and PoC.
>>>>>>>
>>>>>>>     Thanks !
>>>>>>>     Regards
>>>>>>>     JB
>>>>>>>
>>>>>>>     On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote:
>>>>>>>     >
>>>>>>>     >
>>>>>>>     > Le 30 janv. 2018 01:09, "Reuven Lax" <[email protected]
>>>>>>> <mailto:[email protected]>
>>>>>>>      > <mailto:[email protected] <mailto:[email protected]>>> a
>>>>>>> écrit :
>>>>>>>     >
>>>>>>>     >
>>>>>>>     >
>>>>>>>     >     On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau <
>>>>>>> [email protected] <mailto:[email protected]>
>>>>>>>      >     <mailto:[email protected]
>>>>>>>
>>>>>>>     <mailto:[email protected]>>> wrote:
>>>>>>>      >
>>>>>>>      >         Hi
>>>>>>>      >
>>>>>>>      >         I have some questions on this: how hierarchic schemas
>>>>>>>     would work? Seems
>>>>>>>      >         it is not really supported by the ecosystem (out of
>>>>>>>     custom stuff) :(.
>>>>>>>      >         How would it integrate smoothly with other generic
>>>>>>> record
>>>>>>>     types - N bridges?
>>>>>>>      >
>>>>>>>      >
>>>>>>>      >     Do you mean nested schemas? What do you mean here?
>>>>>>>      >
>>>>>>>      >
>>>>>>>      > Yes, sorry - wrote the mail too late ;). Was hierarchic data
>>>>>>> and
>>>>>>>     nested schemas.
>>>>>>>      >
>>>>>>>      >
>>>>>>>      >         Concretely I wonder if using json API couldnt be
>>>>>>>     beneficial: json-p is a
>>>>>>>      >         nice generic abstraction with a built in querying
>>>>>>>     mecanism (jsonpointer)
>>>>>>>      >         but no actual serialization (even if json and binary
>>>>>>> json
>>>>>>>     are very
>>>>>>>      >         natural). The big advantage is to have a well known
>>>>>>>     ecosystem - who
>>>>>>>      >         doesnt know json today? - that beam can reuse for
>>>>>>> free:
>>>>>>>     JsonObject
>>>>>>>      >         (guess we dont want JsonValue abstraction) for the
>>>>>>> record
>>>>>>>     type,
>>>>>>>      >         jsonschema standard for the schema, jsonpointer for
>>>>>>> the
>>>>>>>      >         delection/projection etc... It doesnt enforce the
>>>>>>> actual
>>>>>>>     serialization
>>>>>>>      >         (json, smile, avro, ...) but provide an expressive and
>>>>>>>     alread known API
>>>>>>>      >         so i see it as a big win-win for users (no need to
>>>>>>> learn
>>>>>>>     a new API and
>>>>>>>      >         use N bridges in all ways) and beam (impls are here
>>>>>>> and
>>>>>>>     API design
>>>>>>>      >         already thought).
>>>>>>>      >
>>>>>>>      >
>>>>>>>      >     I assume you're talking about the API for setting schemas,
>>>>>>>     not using them.
>>>>>>>      >     Json has many downsides and I'm not sure it's true that
>>>>>>>     everyone knows it;
>>>>>>>      >     there are also competing schema APIs, such as Avro etc..
>>>>>>>     However I think we
>>>>>>>      >     should give Json a fair evaluation before dismissing it.
>>>>>>>      >
>>>>>>>      >
>>>>>>>      > It is a wider topic than schema. Actually schema are not the
>>>>>>>     first citizen but a
>>>>>>>      > generic data representation is. That is where json hits almost
>>>>>>>     any other API.
>>>>>>>      > Then, when it comes to schema, json has a standard for that
>>>>>>> so we
>>>>>>>     are all good.
>>>>>>>      >
>>>>>>>      > Also json has a good indexing API compared to alternatives
>>>>>>> which
>>>>>>>     are sometimes a
>>>>>>>      > bit faster - for noop transforms - but are hardly usable or
>>>>>>> make
>>>>>>>     the code not
>>>>>>>      > that readable.
>>>>>>>      >
>>>>>>>      > Avro is a nice competitor but it is compatible - actually
>>>>>>> avro is
>>>>>>>     json driven by
>>>>>>>      > design - but its API is far to be that easy due to its schema
>>>>>>>     enforcement which
>>>>>>>      > is heavvvyyy and worse is you cant work with avro without a
>>>>>>>     schema. Json would
>>>>>>>      > allow to reconciliate the dynamic and static cases since the
>>>>>>> job
>>>>>>>     wouldnt change
>>>>>>>      > except the setschema.
>>>>>>>      >
>>>>>>>      > That is why I think json is a good compromise and having a
>>>>>>>     standard API for it
>>>>>>>      > allow to fully customize the imol as will if needed - even
>>>>>>> using
>>>>>>>     avro or protobuf.
>>>>>>>      >
>>>>>>>      > Side note on beam api: i dont think it is good to use a main
>>>>>>> API
>>>>>>>     for runner
>>>>>>>      > optimization. It enforces something to be shared on all
>>>>>>> runners
>>>>>>>     but not widely
>>>>>>>      > usable. It is also misleading for users. Would you set a flink
>>>>>>>     pipeline option
>>>>>>>      > with dataflow? My proposal here is to use hints - properties -
>>>>>>>     instead of
>>>>>>>      > something hardly defined in the API then standardize it if all
>>>>>>>     runners support it.
>>>>>>>      >
>>>>>>>      >
>>>>>>>      >
>>>>>>>      >         Wdyt?
>>>>>>>      >
>>>>>>>      >         Le 29 janv. 2018 06:24, "Jean-Baptiste Onofré"
>>>>>>>     <[email protected] <mailto:[email protected]>
>>>>>>>      >         <mailto:[email protected] <mailto:[email protected]>>> a
>>>>>>> écrit :
>>>>>>>
>>>>>>>      >
>>>>>>>      >             Hi Reuven,
>>>>>>>      >
>>>>>>>      >             Thanks for the update ! As I'm working with you on
>>>>>>>     this, I fully
>>>>>>>      >             agree and great
>>>>>>>      >             doc gathering the ideas.
>>>>>>>      >
>>>>>>>      >             It's clearly something we have to add asap in
>>>>>>> Beam,
>>>>>>>     because it would
>>>>>>>      >             allow new
>>>>>>>      >             use cases for our users (in a simple way) and open
>>>>>>>     new areas for the
>>>>>>>      >             runners
>>>>>>>      >             (for instance dataframe support in the Spark
>>>>>>> runner).
>>>>>>>      >
>>>>>>>      >             By the way, while ago, I created BEAM-3437 to
>>>>>>> track
>>>>>>>     the PoC/PR
>>>>>>>      >             around this.
>>>>>>>      >
>>>>>>>      >             Thanks !
>>>>>>>      >
>>>>>>>      >             Regards
>>>>>>>      >             JB
>>>>>>>      >
>>>>>>>      >             On 01/29/2018 02:08 AM, Reuven Lax wrote:
>>>>>>>      >             > Previously I submitted a proposal for adding
>>>>>>>     schemas as a
>>>>>>>      >             first-class concept on
>>>>>>>      >             > Beam PCollections. The proposal engendered
>>>>>>> quite a
>>>>>>>     bit of
>>>>>>>      >             discussion from the
>>>>>>>      >             > community - more discussion than I've seen from
>>>>>>>     almost any of our
>>>>>>>      >             proposals to
>>>>>>>      >             > date!
>>>>>>>      >             >
>>>>>>>      >             > Based on the feedback and comments, I reworked
>>>>>>> the
>>>>>>>     proposal
>>>>>>>      >             document quite a
>>>>>>>      >             > bit. It now talks more explicitly about the
>>>>>>>     different between
>>>>>>>      >             dynamic schemas
>>>>>>>      >             > (where the schema is not fully not know at
>>>>>>>     graph-creation time),
>>>>>>>      >             and static
>>>>>>>      >             > schemas (which are fully know at graph-creation
>>>>>>>     time). Proposed
>>>>>>>      >             APIs are more
>>>>>>>      >             > fleshed out now (again thanks to feedback from
>>>>>>>     community members),
>>>>>>>      >             and the
>>>>>>>      >             > document talks in more detail about evolving
>>>>>>> schemas in
>>>>>>>      >             long-running streaming
>>>>>>>      >             > pipelines.
>>>>>>>      >             >
>>>>>>>      >             > Please take a look. I think this will be very
>>>>>>>     valuable to Beam,
>>>>>>>      >             and welcome any
>>>>>>>      >             > feedback.
>>>>>>>      >             >
>>>>>>>      >             >
>>>>>>>      >
>>>>>>>     https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ
>>>>>>> 12pHGK0QIvXS1FOTgRc/edit#
>>>>>>>     <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm
>>>>>>> Q12pHGK0QIvXS1FOTgRc/edit#>
>>>>>>>      >                 <https://docs.google.com/docu
>>>>>>> ment/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit# <
>>>>>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm
>>>>>>> Q12pHGK0QIvXS1FOTgRc/edit#>>
>>>>>>>      >             >
>>>>>>>      >             > Reuven
>>>>>>>      >
>>>>>>>      >             --
>>>>>>>      >             Jean-Baptiste Onofré
>>>>>>>      > [email protected] <mailto:[email protected]>
>>>>>>>     <mailto:[email protected] <mailto:[email protected]>>
>>>>>>>      > http://blog.nanthrax.net
>>>>>>>      >             Talend - http://www.talend.com
>>>>>>>      >
>>>>>>>      >
>>>>>>>      >
>>>>>>>
>>>>>>>     --
>>>>>>>     Jean-Baptiste Onofré
>>>>>>>     [email protected] <mailto:[email protected]>
>>>>>>>     http://blog.nanthrax.net
>>>>>>>     Talend - http://www.talend.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>

Re: Schema-Aware PCollections revisited

Reply via email to