Hi all, If there are no concerns, I would like to start working on a prototype. It's just a prototype, so I don't think it will have the final API (e.g. for the prototype I'm going to avoid change the API of PCollection, and use a "special" Coder instead). Also even once we go beyond prototype, it will be @Experimental for some time, so the API will not be fixed in stone.
Any more comments on this approach before we start implementing a prototype? Reuven On Wed, Jan 31, 2018 at 1:12 PM, Romain Manni-Bucau <rmannibu...@gmail.com> wrote: > If you need help on the json part I'm happy to help. To give a few hints > on what is very doable: we can add an avro module to johnzon (asf json{p,b} > impl) to back jsonp by avro (guess it will be one of the first to be asked) > for instance. > > > Romain Manni-Bucau > @rmannibucau <https://twitter.com/rmannibucau> | Blog > <https://rmannibucau.metawerx.net/> | Old Blog > <http://rmannibucau.wordpress.com> | Github > <https://github.com/rmannibucau> | LinkedIn > <https://www.linkedin.com/in/rmannibucau> > > 2018-01-31 22:06 GMT+01:00 Reuven Lax <re...@google.com>: > >> Agree. The initial implementation will be a prototype. >> >> On Wed, Jan 31, 2018 at 12:21 PM, Jean-Baptiste Onofré <j...@nanthrax.net> >> wrote: >> >>> Hi Reuven, >>> >>> Agree to be able to describe the schema with different format. The good >>> point about json schemas is that they are described by a spec. My point is >>> also to avoid the reinvent the wheel. Just an abstract to be able to use >>> Avro, Json, Calcite, custom schema descriptors would be great. >>> >>> Using coder to describe a schema sounds like a smart move to implement >>> quickly. However, it has to be clear in term of documentation to avoid >>> "side effect". I still think PCollection.setSchema() is better: it should >>> be metadata (or hint ;))) on the PCollection. >>> >>> Regards >>> JB >>> >>> On 31/01/2018 20:16, Reuven Lax wrote: >>> >>>> As to the question of how a schema should be specified, I want to >>>> support several common schema formats. So if a user has a Json schema, or >>>> an Avro schema, or a Calcite schema, etc. there should be adapters that >>>> allow setting a schema from any of them. I don't think we should prefer one >>>> over the other. While Romain is right that many people know Json, I think >>>> far fewer people know Json schemas. >>>> >>>> Agree, schemas should not be enforced (for one thing, that wouldn't be >>>> backwards compatible!). I think for the initial prototype I will probably >>>> use a special coder to represent the schema (with setSchema an option on >>>> the coder), largely because it doesn't require modifying PCollection. >>>> However I think longer term a schema should be an optional piece of >>>> metadata on the PCollection object. Similar to the previous discussion >>>> about "hints," I think this can be set on the producing PTransform, and a >>>> SetSchema PTransform will allow attaching a schema to any PCollection (i.e. >>>> pc.apply(SetSchema.of(schema))). This part isn't designed yet, but I >>>> think schema should be similar to hints, it's just another piece of >>>> metadata on the PCollection (though something interpreted by the model, >>>> where hints are interpreted by the runner) >>>> >>>> Reuven >>>> >>>> On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré <j...@nanthrax.net >>>> <mailto:j...@nanthrax.net>> wrote: >>>> >>>> Hi, >>>> >>>> I think we should avoid to mix two things in the discussion (and so >>>> the document): >>>> >>>> 1. The element of the collection and the schema itself are two >>>> different things. >>>> By essence, Beam should not enforce any schema. That's why I think >>>> it's a good >>>> idea to set the schema optionally on the PCollection >>>> (pcollection.setSchema()). >>>> >>>> 2. From point 1 comes two questions: how do we represent a schema ? >>>> How can we >>>> leverage the schema to simplify the serialization of the element in >>>> the >>>> PCollection and query ? These two questions are not directly >>>> related. >>>> >>>> 2.1 How do we represent the schema >>>> Json Schema is a very interesting idea. It could be an abstract and >>>> other >>>> providers, like Avro, can be bind on it. It's part of the json >>>> processing spec >>>> (javax). >>>> >>>> 2.2. How do we leverage the schema for query and serialization >>>> Also in the spec, json pointer is interesting for the querying. >>>> Regarding the >>>> serialization, jackson or other data binder can be used. >>>> >>>> It's still rough ideas in my mind, but I like Romain's idea about >>>> json-p usage. >>>> >>>> Once 2.3.0 release is out, I will start to update the document with >>>> those ideas, >>>> and PoC. >>>> >>>> Thanks ! >>>> Regards >>>> JB >>>> >>>> On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote: >>>> > >>>> > >>>> > Le 30 janv. 2018 01:09, "Reuven Lax" <re...@google.com <mailto: >>>> re...@google.com> >>>> > <mailto:re...@google.com <mailto:re...@google.com>>> a écrit : >>>> > >>>> > >>>> > >>>> > On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau < >>>> rmannibu...@gmail.com <mailto:rmannibu...@gmail.com> >>>> > <mailto:rmannibu...@gmail.com >>>> >>>> <mailto:rmannibu...@gmail.com>>> wrote: >>>> > >>>> > Hi >>>> > >>>> > I have some questions on this: how hierarchic schemas >>>> would work? Seems >>>> > it is not really supported by the ecosystem (out of >>>> custom stuff) :(. >>>> > How would it integrate smoothly with other generic record >>>> types - N bridges? >>>> > >>>> > >>>> > Do you mean nested schemas? What do you mean here? >>>> > >>>> > >>>> > Yes, sorry - wrote the mail too late ;). Was hierarchic data and >>>> nested schemas. >>>> > >>>> > >>>> > Concretely I wonder if using json API couldnt be >>>> beneficial: json-p is a >>>> > nice generic abstraction with a built in querying >>>> mecanism (jsonpointer) >>>> > but no actual serialization (even if json and binary json >>>> are very >>>> > natural). The big advantage is to have a well known >>>> ecosystem - who >>>> > doesnt know json today? - that beam can reuse for free: >>>> JsonObject >>>> > (guess we dont want JsonValue abstraction) for the record >>>> type, >>>> > jsonschema standard for the schema, jsonpointer for the >>>> > delection/projection etc... It doesnt enforce the actual >>>> serialization >>>> > (json, smile, avro, ...) but provide an expressive and >>>> alread known API >>>> > so i see it as a big win-win for users (no need to learn >>>> a new API and >>>> > use N bridges in all ways) and beam (impls are here and >>>> API design >>>> > already thought). >>>> > >>>> > >>>> > I assume you're talking about the API for setting schemas, >>>> not using them. >>>> > Json has many downsides and I'm not sure it's true that >>>> everyone knows it; >>>> > there are also competing schema APIs, such as Avro etc.. >>>> However I think we >>>> > should give Json a fair evaluation before dismissing it. >>>> > >>>> > >>>> > It is a wider topic than schema. Actually schema are not the >>>> first citizen but a >>>> > generic data representation is. That is where json hits almost >>>> any other API. >>>> > Then, when it comes to schema, json has a standard for that so we >>>> are all good. >>>> > >>>> > Also json has a good indexing API compared to alternatives which >>>> are sometimes a >>>> > bit faster - for noop transforms - but are hardly usable or make >>>> the code not >>>> > that readable. >>>> > >>>> > Avro is a nice competitor but it is compatible - actually avro is >>>> json driven by >>>> > design - but its API is far to be that easy due to its schema >>>> enforcement which >>>> > is heavvvyyy and worse is you cant work with avro without a >>>> schema. Json would >>>> > allow to reconciliate the dynamic and static cases since the job >>>> wouldnt change >>>> > except the setschema. >>>> > >>>> > That is why I think json is a good compromise and having a >>>> standard API for it >>>> > allow to fully customize the imol as will if needed - even using >>>> avro or protobuf. >>>> > >>>> > Side note on beam api: i dont think it is good to use a main API >>>> for runner >>>> > optimization. It enforces something to be shared on all runners >>>> but not widely >>>> > usable. It is also misleading for users. Would you set a flink >>>> pipeline option >>>> > with dataflow? My proposal here is to use hints - properties - >>>> instead of >>>> > something hardly defined in the API then standardize it if all >>>> runners support it. >>>> > >>>> > >>>> > >>>> > Wdyt? >>>> > >>>> > Le 29 janv. 2018 06:24, "Jean-Baptiste Onofré" >>>> <j...@nanthrax.net <mailto:j...@nanthrax.net> >>>> > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>> a >>>> écrit : >>>> >>>> > >>>> > Hi Reuven, >>>> > >>>> > Thanks for the update ! As I'm working with you on >>>> this, I fully >>>> > agree and great >>>> > doc gathering the ideas. >>>> > >>>> > It's clearly something we have to add asap in Beam, >>>> because it would >>>> > allow new >>>> > use cases for our users (in a simple way) and open >>>> new areas for the >>>> > runners >>>> > (for instance dataframe support in the Spark runner). >>>> > >>>> > By the way, while ago, I created BEAM-3437 to track >>>> the PoC/PR >>>> > around this. >>>> > >>>> > Thanks ! >>>> > >>>> > Regards >>>> > JB >>>> > >>>> > On 01/29/2018 02:08 AM, Reuven Lax wrote: >>>> > > Previously I submitted a proposal for adding >>>> schemas as a >>>> > first-class concept on >>>> > > Beam PCollections. The proposal engendered quite a >>>> bit of >>>> > discussion from the >>>> > > community - more discussion than I've seen from >>>> almost any of our >>>> > proposals to >>>> > > date! >>>> > > >>>> > > Based on the feedback and comments, I reworked the >>>> proposal >>>> > document quite a >>>> > > bit. It now talks more explicitly about the >>>> different between >>>> > dynamic schemas >>>> > > (where the schema is not fully not know at >>>> graph-creation time), >>>> > and static >>>> > > schemas (which are fully know at graph-creation >>>> time). Proposed >>>> > APIs are more >>>> > > fleshed out now (again thanks to feedback from >>>> community members), >>>> > and the >>>> > > document talks in more detail about evolving >>>> schemas in >>>> > long-running streaming >>>> > > pipelines. >>>> > > >>>> > > Please take a look. I think this will be very >>>> valuable to Beam, >>>> > and welcome any >>>> > > feedback. >>>> > > >>>> > > >>>> > >>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ >>>> 12pHGK0QIvXS1FOTgRc/edit# >>>> <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm >>>> Q12pHGK0QIvXS1FOTgRc/edit#> >>>> > <https://docs.google.com/docu >>>> ment/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit# < >>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm >>>> Q12pHGK0QIvXS1FOTgRc/edit#>> >>>> > > >>>> > > Reuven >>>> > >>>> > -- >>>> > Jean-Baptiste Onofré >>>> > jbono...@apache.org <mailto:jbono...@apache.org> >>>> <mailto:jbono...@apache.org <mailto:jbono...@apache.org>> >>>> > http://blog.nanthrax.net >>>> > Talend - http://www.talend.com >>>> > >>>> > >>>> > >>>> >>>> -- >>>> Jean-Baptiste Onofré >>>> jbono...@apache.org <mailto:jbono...@apache.org> >>>> http://blog.nanthrax.net >>>> Talend - http://www.talend.com >>>> >>>> >>>> >> >