Cool, can I work with you on this (sharing a branch for instance) ?
Thanks ! Regards JB On 03/05/2018 01:01 PM, Reuven Lax wrote: > Yes, I do have a PoC in progress. The Beam Row class was being refactored, so > I > paused to wait for that to finish. > > > On Sun, Mar 4, 2018 at 8:24 PM Jean-Baptiste Onofré <j...@nanthrax.net > <mailto:j...@nanthrax.net>> wrote: > > Hi Reuven, > > I revive this discussion as I think it would be a great addition. > > We had discussion on the fly, but I think now, as base for discussion, it > would > be great to have a feature branch where we can start some sketch/impl and > discuss. > > @Reuven, did you start a PoC with what you proposed: > - SchemaCoder > - SchemaRegistry > - @FieldAccess on DoFn > - Select.fields PTransform > ? > > If not, I'm volunteer to start the branch and start to sketch. > > Thoughts ? > > Regards > JB > > On 02/04/2018 08:23 PM, Reuven Lax wrote: > > Cool, let's chat about this on slack for a bit (which I realized I've > been > > signed out of for some time). > > > > Reuven > > > > On Sun, Feb 4, 2018 at 9:21 AM, Jean-Baptiste Onofré <j...@nanthrax.net > <mailto:j...@nanthrax.net> > > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>> wrote: > > > > Sorry guys, I was off today. Happy to be part of the party too ;) > > > > Regards > > JB > > > > On 02/04/2018 06:19 PM, Reuven Lax wrote: > > > Romain, since you're interested maybe the two of us should put > together a > > > proposal for how to set this things (hints, schema) on > PCollections? > I don't > > > think it'll be hard - the previous list thread on hints already > agreed on a > > > general approach, and we would just need to flesh it out. > > > > > > BTW in the past when I looked, Json schemas seemed to have some > odd > limitations > > > inherited from Javascript (e.g. no distinction between integer and > > > floating-point types). Is that still true? > > > > > > Reuven > > > > > > On Sun, Feb 4, 2018 at 9:12 AM, Romain Manni-Bucau > <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com> > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> > > > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com> > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>>>> wrote: > > > > > > > > > > > > 2018-02-04 17:53 GMT+01:00 Reuven Lax <re...@google.com > <mailto:re...@google.com> <mailto:re...@google.com > <mailto:re...@google.com>> > > > <mailto:re...@google.com <mailto:re...@google.com> > <mailto:re...@google.com <mailto:re...@google.com>>>>: > > > > > > > > > > > > On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau > > > <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com> > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> > > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com> > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>>>> wrote: > > > > > > > > > 2018-02-04 17:37 GMT+01:00 Reuven Lax > <re...@google.com > <mailto:re...@google.com> <mailto:re...@google.com > <mailto:re...@google.com>> > > > <mailto:re...@google.com <mailto:re...@google.com> > <mailto:re...@google.com <mailto:re...@google.com>>>>: > > > > > > I'm not sure where proto comes from here. Proto is > one example > > > of a type that has a schema, but only one example. > > > > > > 1. In the initial prototype I want to avoid > modifying the > > > PCollection API. So I think it's best to create a > special > > > SchemaCoder, and pass the schema into this coder. > Later we > > might > > > targeted APIs for this instead of going through a > coder. > > > 1.a I don't see what hints have to do with this? > > > > > > > > > Hints are a way to replace the new API and unify the > way > to pass > > > metadata in beam instead of adding a new custom way > each > time. > > > > > > > > > I don't think schema is a hint. But I hear what your > saying > - hint > > is a > > > type of PCollection metadata as is schema, and we should > have a > > unified > > > API for setting such metadata. > > > > > > > > > :), Ismael pointed me out earlier this week that "hint" had an > old meaning > > > in beam. My usage is purely the one done in most EE spec (your > > "metadata" in > > > previous answer). But guess we are aligned on the meaning now, > just wanted > > > to be sure. > > > > > > > > > > > > > > > > > > > > > > > > 2. BeamSQL already has a generic record type > which fits > > this use > > > case very well (though we might modify it). > However as > > mentioned > > > in the doc, the user is never forced to use this > generic > > record > > > type. > > > > > > > > > Well yes and not. A type already exists but 1. it is > very strictly > > > limited (flat/columns only which is very few of what > big > data SQL > > > can do) and 2. it must be aligned on the converge of > generic data > > > the schema will bring (really read "aligned" as > "dropped > in favor > > > of" - deprecated being a smooth way to do it). > > > > > > > > > As I said the existing class needs to be modified and > extended, > > and not > > > just for this schema us was. It was meant to represent > Calcite SQL > > rows, > > > but doesn't quite even do that yet (Calcite supports > nested > rows). > > > However I think it's the right basis to start from. > > > > > > > > > Agree on the state. Current impl issues I hit (additionally to > the nested > > > support which would require by itself a kind of visitor > solution) are the > > > fact to own the schema in the record and handle field by > field the > > > serialization instead of as a whole which is how it would be > handled > > with a > > > schema IMHO. > > > > > > Concretely what I don't want is to do a PoC which works - they > all work > > > right? and integrate to beam without thinking to a global > solution for > > this > > > generic record issue and its schema standardization. This is > where > > Json(-P) > > > has a lot of value IMHO but requires a bit more love than just > adding > > schema > > > in the model. > > > > > > > > > > > > > > > > > > So long story short the main work of this schema track > is not only > > > on using schema in runners and other ways but also > starting to > > make > > > beam consistent with itself which is probably the most > important > > > outcome since it is the user facing side of this work. > > > > > > > > > > > > On Sun, Feb 4, 2018 at 12:22 AM, Romain > Manni-Bucau > > > <rmannibu...@gmail.com > <mailto:rmannibu...@gmail.com> <mailto:rmannibu...@gmail.com > <mailto:rmannibu...@gmail.com>> > > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com> > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>>>> wrote: > > > > > > @Reuven: is the proto only about passing > schema > or also the > > > generic type? > > > > > > There are 2.5 topics to solve this issue: > > > > > > 1. How to pass schema > > > 1.a. hints? > > > 2. What is the generic record type associated > to > a schema > > > and how to express a schema relatively to it > > > > > > I would be happy to help on 1.a and 2 somehow > if > you need. > > > > > > Le 4 févr. 2018 03:30, "Reuven Lax" > <re...@google.com <mailto:re...@google.com> <mailto:re...@google.com > <mailto:re...@google.com>> > > > <mailto:re...@google.com > <mailto:re...@google.com> <mailto:re...@google.com > <mailto:re...@google.com>>>> a > > écrit : > > > > > > One more thing. If anyone here has > experience with > > > various OSS metadata stores (e.g. Kafka > Schema Registry > > > is one example), would you like to > collaborate on > > > implementation? I want to make sure that > source schemas > > > can be stored in a variety of OSS metadata > stores, and > > > be easily pulled into a Beam pipeline. > > > > > > Reuven > > > > > > On Sat, Feb 3, 2018 at 6:28 PM, Reuven Lax > > > <re...@google.com > <mailto:re...@google.com> > <mailto:re...@google.com <mailto:re...@google.com>> > <mailto:re...@google.com > <mailto:re...@google.com> > > <mailto:re...@google.com <mailto:re...@google.com>>>> wrote: > > > > > > Hi all, > > > > > > If there are no concerns, I would like > to start > > > working on a prototype. It's just a > prototype, so I > > > don't think it will have the final API > (e.g. for the > > > prototype I'm going to avoid change > the > API of > > > PCollection, and use a "special" Coder > instead). > > > Also even once we go beyond prototype, > it will be > > > @Experimental for some time, so the > API > will not be > > > fixed in stone. > > > > > > Any more comments on this approach > before we start > > > implementing a prototype? > > > > > > Reuven > > > > > > On Wed, Jan 31, 2018 at 1:12 PM, > Romain > Manni-Bucau > > > <rmannibu...@gmail.com > <mailto:rmannibu...@gmail.com> <mailto:rmannibu...@gmail.com > <mailto:rmannibu...@gmail.com>> > > > <mailto:rmannibu...@gmail.com > <mailto:rmannibu...@gmail.com> <mailto:rmannibu...@gmail.com > <mailto:rmannibu...@gmail.com>>>> wrote: > > > > > > If you need help on the json part > I'm happy to > > > help. To give a few hints on what > is > very > > > doable: we can add an avro module > to > johnzon > > > (asf json{p,b} impl) to back jsonp > by avro > > > (guess it will be one of the first > to be asked) > > > for instance. > > > > > > > > > Romain Manni-Bucau > > > @rmannibucau > > <https://twitter.com/rmannibucau <https://twitter.com/rmannibucau>> > | > > > Blog > <https://rmannibucau.metawerx.net/ > > <https://rmannibucau.metawerx.net/>> | Old > > > Blog > <http://rmannibucau.wordpress.com > > <http://rmannibucau.wordpress.com>> | Github > > > <https://github.com/rmannibucau > > <https://github.com/rmannibucau>> | LinkedIn > > > > <https://www.linkedin.com/in/rmannibucau > > <https://www.linkedin.com/in/rmannibucau>> > > > > > > 2018-01-31 22:06 GMT+01:00 Reuven > Lax > > > <re...@google.com > <mailto:re...@google.com> > > <mailto:re...@google.com <mailto:re...@google.com>> > <mailto:re...@google.com <mailto:re...@google.com> > <mailto:re...@google.com > <mailto:re...@google.com>>>>: > > > > > > Agree. The initial > implementation will be a > > > prototype. > > > > > > On Wed, Jan 31, 2018 at 12:21 > PM, > > > Jean-Baptiste Onofré > <j...@nanthrax.net <mailto:j...@nanthrax.net> <mailto:j...@nanthrax.net > <mailto:j...@nanthrax.net>> > > > <mailto:j...@nanthrax.net > <mailto:j...@nanthrax.net> > > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>>> wrote: > > > > > > Hi Reuven, > > > > > > Agree to be able to > describe the > > schema > > > with different format. The > good point > > > about json schemas is that > they are > > > described by a spec. My > point is > > also to > > > avoid the reinvent the > wheel. Just an > > > abstract to be able to use > Avro, Json, > > > Calcite, custom schema > descriptors > > would > > > be great. > > > > > > Using coder to describe a > schema > > sounds > > > like a smart move to > implement > > quickly. > > > However, it has to be > clear > in term of > > > documentation to avoid > "side > > effect". I > > > still think > PCollection.setSchema() is > > > better: it should be > metadata (or hint > > > ;))) on the PCollection. > > > > > > Regards > > > JB > > > > > > On 31/01/2018 20:16, > Reuven > Lax wrote: > > > > > > As to the question of > how a schema > > > should be specified, I > want to > > > support several common > schema > > > formats. So if a user > has a Json > > > schema, or an Avro > schema, or a > > > Calcite schema, etc. > there > > should be > > > adapters that allow > setting a > > schema > > > from any of them. I > don't think we > > > should prefer one over > the other. > > > While Romain is right > that many > > > people know Json, I > think far > > fewer > > > people know Json > schemas. > > > > > > Agree, schemas should > not be > > > enforced (for one > thing, > that > > > wouldn't be backwards > > compatible!). > > > I think for the > initial > > prototype I > > > will probably use a > special > > coder to > > > represent the schema > (with > > setSchema > > > an option on the > coder), > largely > > > because it doesn't > require > > modifying > > > PCollection. However > I think > > longer > > > term a schema should > be an > > optional > > > piece of metadata on > the > > PCollection > > > object. Similar to the > previous > > > discussion about > "hints," I think > > > this can be set on the > producing > > > PTransform, and a > SetSchema > > > PTransform will allow > attaching a > > > schema to any > PCollection (i.e. > > > > pc.apply(SetSchema.of(schema))). > > > This part isn't > designed > yet, > > but I > > > think schema should be > similar to > > > hints, it's just > another > piece of > > > metadata on the > PCollection > > (though > > > something interpreted > by the > > model, > > > where hints are > interpreted by the > > > runner) > > > > > > Reuven > > > > > > On Tue, Jan 30, 2018 > at > 1:37 AM, > > > Jean-Baptiste Onofré > > > <j...@nanthrax.net > <mailto:j...@nanthrax.net> > > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>> > > > > <mailto:j...@nanthrax.net > <mailto:j...@nanthrax.net> > > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>> > > > > <mailto:j...@nanthrax.net > <mailto:j...@nanthrax.net> > > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>> > > > > <mailto:j...@nanthrax.net > <mailto:j...@nanthrax.net> > > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>>>> wrote: > > > > > > Hi, > > > > > > I think we should > avoid to mix > > > two things in the > discussion > > (and so > > > the document): > > > > > > 1. The element of > the > > collection > > > and the schema itself > are two > > > different things. > > > By essence, Beam > should not > > > enforce any schema. > That's why > > I think > > > it's a good > > > idea to set the > schema > > > optionally on the > PCollection > > > > (pcollection.setSchema()). > > > > > > 2. From point 1 > comes two > > > questions: how do we > represent a > > > schema ? > > > How can we > > > leverage the > schema to > > simplify > > > the serialization of > the > > element in the > > > PCollection and > query ? These > > > two questions are not > directly > > related. > > > > > > 2.1 How do we > represent > > the schema > > > Json Schema is a > very > > > interesting idea. It > could be an > > > abstract and > > > other > > > providers, like > Avro, can be > > > bind on it. It's part > of > the json > > > processing spec > > > (javax). > > > > > > 2.2. How do we > leverage the > > > schema for query and > serialization > > > Also in the spec, > json pointer > > > is interesting for the > querying. > > > Regarding the > > > serialization, > jackson or > > other > > > data binder can be > used. > > > > > > It's still rough > ideas in my > > > mind, but I like > Romain's idea > > about > > > json-p usage. > > > > > > Once 2.3.0 release > is out, I > > > will start to update > the > > document with > > > those ideas, > > > and PoC. > > > > > > Thanks ! > > > Regards > > > JB > > > > > > On 01/30/2018 > 08:42 > AM, Romain > > > Manni-Bucau wrote: > > > > > > > > > > > > Le 30 janv. 2018 > 01:09, > > > "Reuven Lax" > <re...@google.com <mailto:re...@google.com> > > <mailto:re...@google.com <mailto:re...@google.com>> > > > > <mailto:re...@google.com > <mailto:re...@google.com> > > <mailto:re...@google.com <mailto:re...@google.com>>> > > > > <mailto:re...@google.com > <mailto:re...@google.com> > > <mailto:re...@google.com <mailto:re...@google.com>> > > > > <mailto:re...@google.com > <mailto:re...@google.com> > > <mailto:re...@google.com <mailto:re...@google.com>>>> > > > > > > <mailto:re...@google.com <mailto:re...@google.com> > <mailto:re...@google.com <mailto:re...@google.com>> > > > > <mailto:re...@google.com > <mailto:re...@google.com> > > <mailto:re...@google.com <mailto:re...@google.com>>> > > > > <mailto:re...@google.com > <mailto:re...@google.com> > > <mailto:re...@google.com <mailto:re...@google.com>> > > > > <mailto:re...@google.com > <mailto:re...@google.com> > > <mailto:re...@google.com <mailto:re...@google.com>>>>>> a écrit : > > > > > > > > > > > > > > > > On Mon, Jan > 29, 2018 at > > > 12:17 PM, Romain > Manni-Bucau > > > <rmannibu...@gmail.com > <mailto:rmannibu...@gmail.com> > > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> > > > > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com> > > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>>> > > > > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com> > > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> > > > > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com> > > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>>>> > > > > > > > > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com> > > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> > > > > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com> > > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>>> > > > > > > > > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com> > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> > > > > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com> > > <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>>>>>> > wrote: > > > > > > > > Hi > > > > > > > > I have > some > > questions > > > on this: how > hierarchic > schemas > > > would work? Seems > > > > it is > not > really > > > supported by the > ecosystem (out of > > > custom stuff) :(. > > > > How > would it > > > integrate smoothly > with > other > > > generic record > > > types - N bridges? > > > > > > > > > > > > Do you mean > nested > > > schemas? What do you > mean here? > > > > > > > > > > > > Yes, sorry - > wrote the mail > > > too late ;). Was > hierarchic > > data and > > > nested schemas. > > > > > > > > > > > > > Concretely I wonder > > > if using json API > couldnt be > > > beneficial: > json-p is a > > > > nice > generic > > > abstraction with a > built in > > querying > > > mecanism > (jsonpointer) > > > > but no > actual > > > serialization (even if > json and > > > binary json > > > are very > > > > > natural). > The big > > > advantage is to have a > well known > > > ecosystem - who > > > > doesnt > know json > > > today? - that beam > can reuse > > for free: > > > JsonObject > > > > (guess > we > dont want > > > JsonValue abstraction) > for the > > record > > > type, > > > > > jsonschema standard > > > for the schema, > jsonpointer > > for the > > > > > > delection/projection > > > etc... It doesnt > enforce the > > actual > > > serialization > > > > (json, > smile, avro, > > > ...) but provide an > expressive and > > > alread known API > > > > so i > see > it as > > a big > > > win-win for users (no > need to > > learn > > > a new API and > > > > use N > bridges > > in all > > > ways) and beam (impls > are here and > > > API design > > > > already > thought). > > > > > > > > > > > > I assume > you're talking > > > about the API for > setting schemas, > > > not using them. > > > > Json has > many > downsides > > > and I'm not sure it's > true that > > > everyone knows it; > > > > there are > also > > competing > > > schema APIs, such as > Avro etc.. > > > However I think we > > > > should give > Json a fair > > > evaluation before > dismissing it. > > > > > > > > > > > > It is a wider > topic than > > > schema. Actually > schema are > > not the > > > first citizen but > a > > > > generic data > representation > > > is. That is where json > hits almost > > > any other API. > > > > Then, when it > comes to > > > schema, json has a > standard > > for that > > > so we > > > are all good. > > > > > > > > Also json has > a good > > indexing > > > API compared to > alternatives which > > > are sometimes a > > > > bit faster - > for noop > > > transforms - but are > hardly usable > > > or make > > > the code not > > > > that readable. > > > > > > > > Avro is a nice > > competitor but > > > it is compatible - > actually > > avro is > > > json driven by > > > > design - but > its > API is far > > > to be that easy due to > its schema > > > enforcement which > > > > is heavvvyyy > and > worse > > is you > > > cant work with avro > without a > > > schema. Json would > > > > allow to > reconciliate the > > > dynamic and static > cases > since > > the job > > > wouldnt change > > > > except the > setschema. > > > > > > > > That is why I > think > > json is a > > > good compromise and > having a > > > standard API for > it > > > > allow to fully > > customize the > > > imol as will if > needed - > even > > using > > > avro or protobuf. > > > > > > > > Side note on > beam > api: > > i dont > > > think it is good to > use > a main API > > > for runner > > > > optimization. > It > enforces > > > something to be shared > on all > > runners > > > but not widely > > > > usable. It is > also > > misleading > > > for users. Would you > set > a flink > > > pipeline option > > > > with dataflow? > My > proposal > > > here is to use hints - > > properties - > > > instead of > > > > something > hardly > defined in > > > the API then > standardize > it if all > > > runners support > it. > > > > > > > > > > > > > > > > Wdyt? > > > > > > > > Le 29 > janv. 2018 > > > 06:24, "Jean-Baptiste > Onofré" > > > <j...@nanthrax.net > <mailto:j...@nanthrax.net> > > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>> > > > > <mailto:j...@nanthrax.net > <mailto:j...@nanthrax.net> > > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>> > > > > <mailto:j...@nanthrax.net > <mailto:j...@nanthrax.net> > > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>> > > > > <mailto:j...@nanthrax.net > <mailto:j...@nanthrax.net> > > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>>> > > > > > > > > <mailto:j...@nanthrax.net > <mailto:j...@nanthrax.net> > > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>> > > > > <mailto:j...@nanthrax.net > <mailto:j...@nanthrax.net> > > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>> > > > > <mailto:j...@nanthrax.net > <mailto:j...@nanthrax.net> > > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>> > > > > <mailto:j...@nanthrax.net > <mailto:j...@nanthrax.net> > > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>>>>> a écrit : > > > > > > > > > > > Hi > Reuven, > > > > > > > > > Thanks for the > > > update ! As I'm > working with > > you on > > > this, I fully > > > > > agree > and great > > > > doc > > gathering the > > > ideas. > > > > > > > > > It's > clearly > > > something we have to > add > asap > > in Beam, > > > because it would > > > > > allow new > > > > > use cases > > for our > > > users (in a simple > way) > and open > > > new areas for the > > > > > runners > > > > > (for > instance > > > dataframe support in > the > Spark > > runner). > > > > > > > > By > the way, > > while > > > ago, I created > BEAM-3437 to > -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com