@Reuven: is the proto only about passing schema or also the generic type? There are 2.5 topics to solve this issue:
1. How to pass schema 1.a. hints? 2. What is the generic record type associated to a schema and how to express a schema relatively to it I would be happy to help on 1.a and 2 somehow if you need. Le 4 févr. 2018 03:30, "Reuven Lax" <re...@google.com> a écrit : > One more thing. If anyone here has experience with various OSS metadata > stores (e.g. Kafka Schema Registry is one example), would you like to > collaborate on implementation? I want to make sure that source schemas can > be stored in a variety of OSS metadata stores, and be easily pulled into a > Beam pipeline. > > Reuven > > On Sat, Feb 3, 2018 at 6:28 PM, Reuven Lax <re...@google.com> wrote: > >> Hi all, >> >> If there are no concerns, I would like to start working on a prototype. >> It's just a prototype, so I don't think it will have the final API (e.g. >> for the prototype I'm going to avoid change the API of PCollection, and use >> a "special" Coder instead). Also even once we go beyond prototype, it will >> be @Experimental for some time, so the API will not be fixed in stone. >> >> Any more comments on this approach before we start implementing a >> prototype? >> >> Reuven >> >> On Wed, Jan 31, 2018 at 1:12 PM, Romain Manni-Bucau < >> rmannibu...@gmail.com> wrote: >> >>> If you need help on the json part I'm happy to help. To give a few hints >>> on what is very doable: we can add an avro module to johnzon (asf json{p,b} >>> impl) to back jsonp by avro (guess it will be one of the first to be asked) >>> for instance. >>> >>> >>> Romain Manni-Bucau >>> @rmannibucau <https://twitter.com/rmannibucau> | Blog >>> <https://rmannibucau.metawerx.net/> | Old Blog >>> <http://rmannibucau.wordpress.com> | Github >>> <https://github.com/rmannibucau> | LinkedIn >>> <https://www.linkedin.com/in/rmannibucau> >>> >>> 2018-01-31 22:06 GMT+01:00 Reuven Lax <re...@google.com>: >>> >>>> Agree. The initial implementation will be a prototype. >>>> >>>> On Wed, Jan 31, 2018 at 12:21 PM, Jean-Baptiste Onofré <j...@nanthrax.net >>>> > wrote: >>>> >>>>> Hi Reuven, >>>>> >>>>> Agree to be able to describe the schema with different format. The >>>>> good point about json schemas is that they are described by a spec. My >>>>> point is also to avoid the reinvent the wheel. Just an abstract to be able >>>>> to use Avro, Json, Calcite, custom schema descriptors would be great. >>>>> >>>>> Using coder to describe a schema sounds like a smart move to implement >>>>> quickly. However, it has to be clear in term of documentation to avoid >>>>> "side effect". I still think PCollection.setSchema() is better: it should >>>>> be metadata (or hint ;))) on the PCollection. >>>>> >>>>> Regards >>>>> JB >>>>> >>>>> On 31/01/2018 20:16, Reuven Lax wrote: >>>>> >>>>>> As to the question of how a schema should be specified, I want to >>>>>> support several common schema formats. So if a user has a Json schema, or >>>>>> an Avro schema, or a Calcite schema, etc. there should be adapters that >>>>>> allow setting a schema from any of them. I don't think we should prefer >>>>>> one >>>>>> over the other. While Romain is right that many people know Json, I think >>>>>> far fewer people know Json schemas. >>>>>> >>>>>> Agree, schemas should not be enforced (for one thing, that wouldn't >>>>>> be backwards compatible!). I think for the initial prototype I will >>>>>> probably use a special coder to represent the schema (with setSchema an >>>>>> option on the coder), largely because it doesn't require modifying >>>>>> PCollection. However I think longer term a schema should be an optional >>>>>> piece of metadata on the PCollection object. Similar to the previous >>>>>> discussion about "hints," I think this can be set on the producing >>>>>> PTransform, and a SetSchema PTransform will allow attaching a schema to >>>>>> any >>>>>> PCollection (i.e. pc.apply(SetSchema.of(schema))). This part isn't >>>>>> designed yet, but I think schema should be similar to hints, it's just >>>>>> another piece of metadata on the PCollection (though something >>>>>> interpreted >>>>>> by the model, where hints are interpreted by the runner) >>>>>> >>>>>> Reuven >>>>>> >>>>>> On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré < >>>>>> j...@nanthrax.net <mailto:j...@nanthrax.net>> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> I think we should avoid to mix two things in the discussion (and >>>>>> so >>>>>> the document): >>>>>> >>>>>> 1. The element of the collection and the schema itself are two >>>>>> different things. >>>>>> By essence, Beam should not enforce any schema. That's why I think >>>>>> it's a good >>>>>> idea to set the schema optionally on the PCollection >>>>>> (pcollection.setSchema()). >>>>>> >>>>>> 2. From point 1 comes two questions: how do we represent a schema >>>>>> ? >>>>>> How can we >>>>>> leverage the schema to simplify the serialization of the element >>>>>> in the >>>>>> PCollection and query ? These two questions are not directly >>>>>> related. >>>>>> >>>>>> 2.1 How do we represent the schema >>>>>> Json Schema is a very interesting idea. It could be an abstract >>>>>> and >>>>>> other >>>>>> providers, like Avro, can be bind on it. It's part of the json >>>>>> processing spec >>>>>> (javax). >>>>>> >>>>>> 2.2. How do we leverage the schema for query and serialization >>>>>> Also in the spec, json pointer is interesting for the querying. >>>>>> Regarding the >>>>>> serialization, jackson or other data binder can be used. >>>>>> >>>>>> It's still rough ideas in my mind, but I like Romain's idea about >>>>>> json-p usage. >>>>>> >>>>>> Once 2.3.0 release is out, I will start to update the document >>>>>> with >>>>>> those ideas, >>>>>> and PoC. >>>>>> >>>>>> Thanks ! >>>>>> Regards >>>>>> JB >>>>>> >>>>>> On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote: >>>>>> > >>>>>> > >>>>>> > Le 30 janv. 2018 01:09, "Reuven Lax" <re...@google.com <mailto: >>>>>> re...@google.com> >>>>>> > <mailto:re...@google.com <mailto:re...@google.com>>> a écrit : >>>>>> > >>>>>> > >>>>>> > >>>>>> > On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau < >>>>>> rmannibu...@gmail.com <mailto:rmannibu...@gmail.com> >>>>>> > <mailto:rmannibu...@gmail.com >>>>>> >>>>>> <mailto:rmannibu...@gmail.com>>> wrote: >>>>>> > >>>>>> > Hi >>>>>> > >>>>>> > I have some questions on this: how hierarchic schemas >>>>>> would work? Seems >>>>>> > it is not really supported by the ecosystem (out of >>>>>> custom stuff) :(. >>>>>> > How would it integrate smoothly with other generic >>>>>> record >>>>>> types - N bridges? >>>>>> > >>>>>> > >>>>>> > Do you mean nested schemas? What do you mean here? >>>>>> > >>>>>> > >>>>>> > Yes, sorry - wrote the mail too late ;). Was hierarchic data >>>>>> and >>>>>> nested schemas. >>>>>> > >>>>>> > >>>>>> > Concretely I wonder if using json API couldnt be >>>>>> beneficial: json-p is a >>>>>> > nice generic abstraction with a built in querying >>>>>> mecanism (jsonpointer) >>>>>> > but no actual serialization (even if json and binary >>>>>> json >>>>>> are very >>>>>> > natural). The big advantage is to have a well known >>>>>> ecosystem - who >>>>>> > doesnt know json today? - that beam can reuse for free: >>>>>> JsonObject >>>>>> > (guess we dont want JsonValue abstraction) for the >>>>>> record >>>>>> type, >>>>>> > jsonschema standard for the schema, jsonpointer for the >>>>>> > delection/projection etc... It doesnt enforce the >>>>>> actual >>>>>> serialization >>>>>> > (json, smile, avro, ...) but provide an expressive and >>>>>> alread known API >>>>>> > so i see it as a big win-win for users (no need to >>>>>> learn >>>>>> a new API and >>>>>> > use N bridges in all ways) and beam (impls are here and >>>>>> API design >>>>>> > already thought). >>>>>> > >>>>>> > >>>>>> > I assume you're talking about the API for setting schemas, >>>>>> not using them. >>>>>> > Json has many downsides and I'm not sure it's true that >>>>>> everyone knows it; >>>>>> > there are also competing schema APIs, such as Avro etc.. >>>>>> However I think we >>>>>> > should give Json a fair evaluation before dismissing it. >>>>>> > >>>>>> > >>>>>> > It is a wider topic than schema. Actually schema are not the >>>>>> first citizen but a >>>>>> > generic data representation is. That is where json hits almost >>>>>> any other API. >>>>>> > Then, when it comes to schema, json has a standard for that so >>>>>> we >>>>>> are all good. >>>>>> > >>>>>> > Also json has a good indexing API compared to alternatives >>>>>> which >>>>>> are sometimes a >>>>>> > bit faster - for noop transforms - but are hardly usable or >>>>>> make >>>>>> the code not >>>>>> > that readable. >>>>>> > >>>>>> > Avro is a nice competitor but it is compatible - actually avro >>>>>> is >>>>>> json driven by >>>>>> > design - but its API is far to be that easy due to its schema >>>>>> enforcement which >>>>>> > is heavvvyyy and worse is you cant work with avro without a >>>>>> schema. Json would >>>>>> > allow to reconciliate the dynamic and static cases since the >>>>>> job >>>>>> wouldnt change >>>>>> > except the setschema. >>>>>> > >>>>>> > That is why I think json is a good compromise and having a >>>>>> standard API for it >>>>>> > allow to fully customize the imol as will if needed - even >>>>>> using >>>>>> avro or protobuf. >>>>>> > >>>>>> > Side note on beam api: i dont think it is good to use a main >>>>>> API >>>>>> for runner >>>>>> > optimization. It enforces something to be shared on all runners >>>>>> but not widely >>>>>> > usable. It is also misleading for users. Would you set a flink >>>>>> pipeline option >>>>>> > with dataflow? My proposal here is to use hints - properties - >>>>>> instead of >>>>>> > something hardly defined in the API then standardize it if all >>>>>> runners support it. >>>>>> > >>>>>> > >>>>>> > >>>>>> > Wdyt? >>>>>> > >>>>>> > Le 29 janv. 2018 06:24, "Jean-Baptiste Onofré" >>>>>> <j...@nanthrax.net <mailto:j...@nanthrax.net> >>>>>> > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>> a >>>>>> écrit : >>>>>> >>>>>> > >>>>>> > Hi Reuven, >>>>>> > >>>>>> > Thanks for the update ! As I'm working with you on >>>>>> this, I fully >>>>>> > agree and great >>>>>> > doc gathering the ideas. >>>>>> > >>>>>> > It's clearly something we have to add asap in Beam, >>>>>> because it would >>>>>> > allow new >>>>>> > use cases for our users (in a simple way) and open >>>>>> new areas for the >>>>>> > runners >>>>>> > (for instance dataframe support in the Spark >>>>>> runner). >>>>>> > >>>>>> > By the way, while ago, I created BEAM-3437 to track >>>>>> the PoC/PR >>>>>> > around this. >>>>>> > >>>>>> > Thanks ! >>>>>> > >>>>>> > Regards >>>>>> > JB >>>>>> > >>>>>> > On 01/29/2018 02:08 AM, Reuven Lax wrote: >>>>>> > > Previously I submitted a proposal for adding >>>>>> schemas as a >>>>>> > first-class concept on >>>>>> > > Beam PCollections. The proposal engendered quite >>>>>> a >>>>>> bit of >>>>>> > discussion from the >>>>>> > > community - more discussion than I've seen from >>>>>> almost any of our >>>>>> > proposals to >>>>>> > > date! >>>>>> > > >>>>>> > > Based on the feedback and comments, I reworked >>>>>> the >>>>>> proposal >>>>>> > document quite a >>>>>> > > bit. It now talks more explicitly about the >>>>>> different between >>>>>> > dynamic schemas >>>>>> > > (where the schema is not fully not know at >>>>>> graph-creation time), >>>>>> > and static >>>>>> > > schemas (which are fully know at graph-creation >>>>>> time). Proposed >>>>>> > APIs are more >>>>>> > > fleshed out now (again thanks to feedback from >>>>>> community members), >>>>>> > and the >>>>>> > > document talks in more detail about evolving >>>>>> schemas in >>>>>> > long-running streaming >>>>>> > > pipelines. >>>>>> > > >>>>>> > > Please take a look. I think this will be very >>>>>> valuable to Beam, >>>>>> > and welcome any >>>>>> > > feedback. >>>>>> > > >>>>>> > > >>>>>> > >>>>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ >>>>>> 12pHGK0QIvXS1FOTgRc/edit# >>>>>> <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm >>>>>> Q12pHGK0QIvXS1FOTgRc/edit#> >>>>>> > <https://docs.google.com/docu >>>>>> ment/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit# < >>>>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm >>>>>> Q12pHGK0QIvXS1FOTgRc/edit#>> >>>>>> > > >>>>>> > > Reuven >>>>>> > >>>>>> > -- >>>>>> > Jean-Baptiste Onofré >>>>>> > jbono...@apache.org <mailto:jbono...@apache.org> >>>>>> <mailto:jbono...@apache.org <mailto:jbono...@apache.org>> >>>>>> > http://blog.nanthrax.net >>>>>> > Talend - http://www.talend.com >>>>>> > >>>>>> > >>>>>> > >>>>>> >>>>>> -- >>>>>> Jean-Baptiste Onofré >>>>>> jbono...@apache.org <mailto:jbono...@apache.org> >>>>>> http://blog.nanthrax.net >>>>>> Talend - http://www.talend.com >>>>>> >>>>>> >>>>>> >>>> >>> >> >