Le 30 janv. 2018 01:09, "Reuven Lax" <re...@google.com> a écrit :
On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau <rmannibu...@gmail.com> wrote: > Hi > > I have some questions on this: how hierarchic schemas would work? Seems it > is not really supported by the ecosystem (out of custom stuff) :(. How > would it integrate smoothly with other generic record types - N bridges? > Do you mean nested schemas? What do you mean here? Yes, sorry - wrote the mail too late ;). Was hierarchic data and nested schemas. > Concretely I wonder if using json API couldnt be beneficial: json-p is a > nice generic abstraction with a built in querying mecanism (jsonpointer) > but no actual serialization (even if json and binary json are very > natural). The big advantage is to have a well known ecosystem - who doesnt > know json today? - that beam can reuse for free: JsonObject (guess we dont > want JsonValue abstraction) for the record type, jsonschema standard for > the schema, jsonpointer for the delection/projection etc... It doesnt > enforce the actual serialization (json, smile, avro, ...) but provide an > expressive and alread known API so i see it as a big win-win for users (no > need to learn a new API and use N bridges in all ways) and beam (impls are > here and API design already thought). > I assume you're talking about the API for setting schemas, not using them. Json has many downsides and I'm not sure it's true that everyone knows it; there are also competing schema APIs, such as Avro etc.. However I think we should give Json a fair evaluation before dismissing it. It is a wider topic than schema. Actually schema are not the first citizen but a generic data representation is. That is where json hits almost any other API. Then, when it comes to schema, json has a standard for that so we are all good. Also json has a good indexing API compared to alternatives which are sometimes a bit faster - for noop transforms - but are hardly usable or make the code not that readable. Avro is a nice competitor but it is compatible - actually avro is json driven by design - but its API is far to be that easy due to its schema enforcement which is heavvvyyy and worse is you cant work with avro without a schema. Json would allow to reconciliate the dynamic and static cases since the job wouldnt change except the setschema. That is why I think json is a good compromise and having a standard API for it allow to fully customize the imol as will if needed - even using avro or protobuf. Side note on beam api: i dont think it is good to use a main API for runner optimization. It enforces something to be shared on all runners but not widely usable. It is also misleading for users. Would you set a flink pipeline option with dataflow? My proposal here is to use hints - properties - instead of something hardly defined in the API then standardize it if all runners support it. > Wdyt? > > Le 29 janv. 2018 06:24, "Jean-Baptiste Onofré" <j...@nanthrax.net> a écrit : > >> Hi Reuven, >> >> Thanks for the update ! As I'm working with you on this, I fully agree >> and great >> doc gathering the ideas. >> >> It's clearly something we have to add asap in Beam, because it would >> allow new >> use cases for our users (in a simple way) and open new areas for the >> runners >> (for instance dataframe support in the Spark runner). >> >> By the way, while ago, I created BEAM-3437 to track the PoC/PR around >> this. >> >> Thanks ! >> >> Regards >> JB >> >> On 01/29/2018 02:08 AM, Reuven Lax wrote: >> > Previously I submitted a proposal for adding schemas as a first-class >> concept on >> > Beam PCollections. The proposal engendered quite a bit of discussion >> from the >> > community - more discussion than I've seen from almost any of our >> proposals to >> > date! >> > >> > Based on the feedback and comments, I reworked the proposal document >> quite a >> > bit. It now talks more explicitly about the different between dynamic >> schemas >> > (where the schema is not fully not know at graph-creation time), and >> static >> > schemas (which are fully know at graph-creation time). Proposed APIs >> are more >> > fleshed out now (again thanks to feedback from community members), and >> the >> > document talks in more detail about evolving schemas in long-running >> streaming >> > pipelines. >> > >> > Please take a look. I think this will be very valuable to Beam, and >> welcome any >> > feedback. >> > >> > https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ >> 12pHGK0QIvXS1FOTgRc/edit# >> > >> > Reuven >> >> -- >> Jean-Baptiste Onofré >> jbono...@apache.org >> http://blog.nanthrax.net >> Talend - http://www.talend.com >> >