Yeah, all schemas are verified when the pipeline is construct (before anything starts running). BTW - under the covers schemas are implemented as a special type of coder, and coders are always set on a PCollection.
I'm happy to add explicit conversion transforms as well for Beam users, though as I mentioned generic transforms and frameworks like SQL will probably not find it convenient to use them. On Tue, May 22, 2018 at 11:02 PM Romain Manni-Bucau <rmannibu...@gmail.com> wrote: > > > Le mer. 23 mai 2018 07:55, Jean-Baptiste Onofré <j...@nanthrax.net> a > écrit : > >> Hi, >> >> IMHO, it would be better to have a explicit transform/IO as converter. >> >> It would be easier for users. >> >> Another option would be to use a "TypeConverter/SchemaConverter" map as >> we do in Camel: Beam could check the source/destination "type" and check >> in the map if there's a converter available. This map can be store as >> part of the pipeline (as we do for filesystem registration). >> > > > It works in camel because it is not strongly typed, isnt it? So can > require a beam new pipeline api. > > +1 for the explicit transform, if added to the pipeline api as coder it > wouldnt break the fluent api: > > p.apply(io).setOutputType(Foo.class) > > Coders can be a workaround since they owns the type but since the > pcollection is the real owner it is surely saner this way, no? > > Also it needs to ensure all converters are present before running the > pipeline probably, no implicit environment converter support is probably > good to start to avoid late surprises. > > > >> My $0.01 >> >> Regards >> JB >> >> On 23/05/2018 07:51, Romain Manni-Bucau wrote: >> > How does it work on the pipeline side? >> > Do you generate these "virtual" IO at build time to enable the fluent >> > API to work not erasing generics? >> > >> > ex: SQL(row)->BigQuery(native) will not compile so we need a >> > SQL(row)->BigQuery(row) >> > >> > Side note unrelated to Row: if you add another registry maybe a pretask >> > is to ensure beam has a kind of singleton/context to avoid to duplicate >> > it or not track it properly. These kind of converters will need a global >> > close and not only per record in general: >> > converter.init();converter.convert(row);....converter.destroy();, >> > otherwise it easily leaks. This is why it can require some way to not >> > recreate it. A quick fix, if you are in bytebuddy already, can be to add >> > it to setup/teardown pby, being more global would be nicer but is more >> > challenging. >> > >> > Romain Manni-Bucau >> > @rmannibucau <https://twitter.com/rmannibucau> | Blog >> > <https://rmannibucau.metawerx.net/> | Old Blog >> > <http://rmannibucau.wordpress.com> | Github >> > <https://github.com/rmannibucau> | LinkedIn >> > <https://www.linkedin.com/in/rmannibucau> | Book >> > < >> https://www.packtpub.com/application-development/java-ee-8-high-performance >> > >> > >> > >> > Le mer. 23 mai 2018 à 07:22, Reuven Lax <re...@google.com >> > <mailto:re...@google.com>> a écrit : >> > >> > No - the only modules we need to add to core are the ones we choose >> > to add. For example, I will probably add a registration for >> > TableRow/TableSchema (GCP BigQuery) so these can work seamlessly >> > with schemas. However I will add that to the GCP module, so only >> > someone depending on that module need to pull in that dependency. >> > The Java ServiceLoader framework can be used by these modules to >> > register schemas for their types (we already do something similar >> > for FileSystem and for coders as well). >> > >> > BTW, right now the conversion back and forth between Row objects I'm >> > doing in the ByteBuddy generated bytecode that we generate in order >> > to invoke DoFns. >> > >> > Reuven >> > >> > On Tue, May 22, 2018 at 10:04 PM Romain Manni-Bucau >> > <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> wrote: >> > >> > Hmm, the pluggability part is close to what I wanted to do with >> > JsonObject as a main API (to avoid to redo a "row" API and >> > schema API) >> > Row.as(Class<T>) sounds good but then, does it mean we'll get >> > beam-sdk-java-row-jsonobject like modules (I'm not against, just >> > trying to understand here)? >> > If so, how an IO can use as() with the type it expects? Doesnt >> > it lead to have a tons of these modules at the end? >> > >> > Romain Manni-Bucau >> > @rmannibucau <https://twitter.com/rmannibucau> | Blog >> > <https://rmannibucau.metawerx.net/> | Old Blog >> > <http://rmannibucau.wordpress.com> | Github >> > <https://github.com/rmannibucau> | LinkedIn >> > <https://www.linkedin.com/in/rmannibucau> | Book >> > < >> https://www.packtpub.com/application-development/java-ee-8-high-performance >> > >> > >> > >> > Le mer. 23 mai 2018 à 04:57, Reuven Lax <re...@google.com >> > <mailto:re...@google.com>> a écrit : >> > >> > By the way Romain, if you have specific scenarios in mind I >> > would love to hear them. I can try and guess what exactly >> > you would like to get out of schemas, but it would work >> > better if you gave me concrete scenarios that you would like >> > to work. >> > >> > Reuven >> > >> > On Tue, May 22, 2018 at 7:45 PM Reuven Lax < >> re...@google.com >> > <mailto:re...@google.com>> wrote: >> > >> > Yeah, what I'm working on will help with IO. Basically >> > if you register a function with SchemaRegistry that >> > converts back and forth between a type (say JsonObject) >> > and a Beam Row, then it is applied by the framework >> > behind the scenes as part of DoFn invocation. Concrete >> > example: let's say I have an IO that reads json objects >> > class MyJsonIORead extends PTransform<PBegin, >> > JsonObject> {...} >> > >> > If you register a schema for this type (or you can also >> > just set the schema directly on the output PCollection), >> > then Beam knows how to convert back and forth between >> > JsonObject and Row. So the next ParDo can look like >> > >> > p.apply(new MyJsonIORead()) >> > .apply(ParDo.of(new DoFn<JsonObject, T>.... >> > @ProcessElement void process(@Element Row row) { >> > }) >> > >> > And Beam will automatically convert JsonObject to a Row >> > for processing (you aren't forced to do this of course - >> > you can always ask for it as a JsonObject). >> > >> > The same is true for output. If you have a sink that >> > takes in JsonObject but the transform before it produces >> > Row objects (for instance - because the transform before >> > it is Beam SQL), Beam can automatically convert Row back >> > to JsonObject for you. >> > >> > All of this was detailed in the Schema doc I shared a >> > few months ago. There was a lot of discussion on that >> > document from various parties, and some of this API is a >> > result of that discussion. This is also working in the >> > branch JB and I were working on, though not yet >> > integrated back to master. >> > >> > I would like to actually go further and make Row an >> > interface and provide a way to automatically put a Row >> > interface on top of any other object (e.g. JsonObject, >> > Pojo, etc.) This won't change the way the user writes >> > code, but instead of Beam having to copy and convert at >> > each stage (e.g. from JsonObject to Row) it simply will >> > create a Row object that uses the the JsonObject as its >> > underlying storage. >> > >> > Reuven >> > >> > On Tue, May 22, 2018 at 11:37 AM Romain Manni-Bucau >> > <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> >> > wrote: >> > >> > Well, beam can implement a new mapper but it doesnt >> > help for io. Most of modern backends will take json >> > directly, even javax one and it must stay generic. >> > >> > Then since json to pojo mapping is already done a >> > dozen of times, not sure it is worth it for now. >> > >> > Le mar. 22 mai 2018 20:27, Reuven Lax >> > <re...@google.com <mailto:re...@google.com>> a >> écrit : >> > >> > We can do even better btw. Building a >> > SchemaRegistry where automatic conversions can >> > be registered between schema and Java data >> > types. With this the user won't even need a DoFn >> > to do the conversion. >> > >> > On Tue, May 22, 2018, 10:13 AM Romain >> > Manni-Bucau <rmannibu...@gmail.com >> > <mailto:rmannibu...@gmail.com>> wrote: >> > >> > Hi guys, >> > >> > Checked out what has been done on schema >> > model and think it is acceptable - regarding >> > the json debate - >> > if >> https://issues.apache.org/jira/browse/BEAM-4381 >> > can be fixed. >> > >> > High level, it is about providing a >> > mainstream and not too impacting model OOTB >> > and JSON seems the most valid option for >> > now, at least for IO and some user >> transforms. >> > >> > Wdyt? >> > >> > Le ven. 27 avr. 2018 18:36, Romain >> > Manni-Bucau <rmannibu...@gmail.com >> > <mailto:rmannibu...@gmail.com>> a écrit : >> > >> > Can give it a try end of may, sure. >> > (holidays and work constraints will make >> > it hard before). >> > >> > Le 27 avr. 2018 18:26, "Anton Kedin" >> > <ke...@google.com >> > <mailto:ke...@google.com>> a écrit : >> > >> > Romain, >> > >> > I don't believe that JSON approach >> > was investigated very thoroughIy. I >> > mentioned few reasons which will >> > make it not the best choice my >> > opinion, but I may be wrong. Can you >> > put together a design doc or a >> > prototype? >> > >> > Thank you, >> > Anton >> > >> > >> > On Thu, Apr 26, 2018 at 10:17 PM >> > Romain Manni-Bucau >> > <rmannibu...@gmail.com >> > <mailto:rmannibu...@gmail.com>> >> wrote: >> > >> > >> > >> > Le 26 avr. 2018 23:13, "Anton >> > Kedin" <ke...@google.com >> > <mailto:ke...@google.com>> a >> écrit : >> > >> > BeamRecord (Row) has very >> > little in common with >> > JsonObject (I assume you're >> > talking about javax.json), >> > except maybe some >> > similarities of the API. Few >> > reasons why JsonObject >> > doesn't work: >> > >> > * it is a Java EE API: >> > o Beam SDK is not >> > limited to Java. >> > There are probably >> > similar APIs for >> > other languages but >> > they might not >> > necessarily carry >> > the same semantics / >> > APIs; >> > >> > >> > Not a big deal I think. At least >> > not a technical blocker. >> > >> > o It can change >> > between Java >> versions; >> > >> > No, this is javaee ;). >> > >> > >> > o Current Beam java >> > implementation is an >> > experimental feature >> > to identify what's >> > needed from such >> > API, in the end we >> > might end up with >> > something similar to >> > JsonObject API, but >> > likely not >> > >> > >> > I dont get that point as a >> blocker >> > >> > o ; >> > * represents JSON, which >> > is not an API but an >> > object notation: >> > o it is defined as >> > unicode string in a >> > certain format. If >> > you choose to adhere >> > to ECMA-404, then it >> > doesn't sound like >> > JsonObject can >> > represent an Avro >> > object, if I'm >> > reading it right; >> > >> > >> > It is in the generator impl, you >> > can impl an avrogenerator. >> > >> > * doesn't define a type >> > system (JSON does, but >> > it's lacking): >> > o for example, JSON >> > doesn't define >> > semantics for >> numbers; >> > o doesn't define >> > date/time types; >> > o doesn't allow >> > extending JSON type >> > system at all; >> > >> > >> > That is why you need a metada >> > object, or simpler, a schema >> > with that data. Json or beam >> > record doesnt help here and you >> > end up on the same outcome if >> > you think about it. >> > >> > * lacks schemas; >> > >> > Jsonschema are standard, widely >> > spread and tooled compared to >> > alternative. >> > >> > You can definitely try >> > loosen the requirements and >> > define everything in JSON in >> > userland, but the point of >> > Row/Schema is to avoid it >> > and define everything in >> > Beam model, which can be >> > extended, mapped to JSON, >> > Avro, BigQuery Schemas, >> > custom binary format etc., >> > with same semantics across >> > beam SDKs. >> > >> > >> > This is what jsonp would allow >> > with the benefit of a natural >> > pojo support through jsonb. >> > >> > >> > >> > On Thu, Apr 26, 2018 at >> > 12:28 PM Romain Manni-Bucau >> > <rmannibu...@gmail.com >> > <mailto: >> rmannibu...@gmail.com>> >> > wrote: >> > >> > Just to let it be clear >> > and let me understand: >> > how is BeamRecord >> > different from a >> > JsonObject which is an >> > API without >> > implementation (not >> > event a json one OOTB)? >> > Advantage of json *api* >> > are indeed natural >> > mapping (jsonb is based >> > on jsonp so no new >> > binding to reinvent) and >> > simple serialization >> > (json+gzip for ex, or >> > avro if you want to be >> > geeky). >> > >> > I fail to see the point >> > to rebuild an ecosystem >> ATM. >> > >> > Le 26 avr. 2018 19:12, >> > "Reuven Lax" >> > <re...@google.com >> > <mailto: >> re...@google.com>> >> > a écrit : >> > >> > Exactly what JB >> > said. We will write >> > a generic conversion >> > from Avro (or json) >> > to Beam schemas, >> > which will make them >> > work transparently >> > with SQL. The plan >> > is also to migrate >> > Anton's work so that >> > POJOs works >> > generically for any >> > schema. >> > >> > Reuven >> > >> > On Thu, Apr 26, 2018 >> > at 1:17 AM >> > Jean-Baptiste Onofré >> > <j...@nanthrax.net >> > <mailto: >> j...@nanthrax.net>> >> > wrote: >> > >> > For now we have >> > a generic schema >> > interface. >> > Json-b can be an >> > impl, avro could >> > be another one. >> > >> > Regards >> > JB >> > Le 26 avr. 2018, >> > à 12:08, Romain >> > Manni-Bucau >> > < >> rmannibu...@gmail.com >> > <mailto: >> rmannibu...@gmail.com>> >> > a écrit: >> > >> > Hmm, >> > >> > avro has >> > still the >> > pitfalls to >> > have an >> > uncontrolled >> > stack which >> > brings way >> > too much >> > dependencies >> > to be part >> > of any API, >> > this is why >> > I proposed a >> > JSON-P based >> > API >> > (JsonObject) >> > with a >> > custom beam >> > entry for >> > some >> > metadata >> > (headers "à >> > la Camel"). >> > >> > >> > Romain >> > Manni-Bucau >> > @rmannibucau >> > < >> https://twitter.com/rmannibucau> >> > | Blog >> > < >> https://rmannibucau.metawerx.net/> | >> > Old Blog >> > < >> http://rmannibucau.wordpress.com> >> > | Github >> > < >> https://github.com/rmannibucau> | >> > LinkedIn >> > < >> https://www.linkedin.com/in/rmannibucau> | >> > Book >> > < >> https://www.packtpub.com/application-development/java-ee-8-high-performance >> > >> > >> > >> > 2018-04-26 >> > 9:59 >> > GMT+02:00 >> > >> Jean-Baptiste Onofré >> > < >> j...@nanthrax.net >> > <mailto: >> j...@nanthrax.net>>: >> > >> > >> > Hi >> Ismael >> > >> > You mean >> > directly >> > in Beam >> > SQL ? >> > >> > That >> > will be >> > part of >> > schema >> > support: >> > generic >> > record >> > could be >> > one of >> > the >> > payload >> > with >> > across >> > schema. >> > >> > Regards >> > JB >> > Le 26 >> > avr. >> > 2018, à >> > 11:39, >> > "Ismaël >> > Mejía" < >> > >> ieme...@gmail.com >> > <mailto: >> ieme...@gmail.com>> >> > a écrit: >> > >> > >> Hello Anton, >> > >> > >> Thanks for the descriptive email and the really useful work. Any plans >> > to >> tackle PCollections of GenericRecord/IndexedRecords? it seems Avro >> > is >> a natural fit for this approach too. >> > >> > >> Regards, >> > >> Ismaël >> > >> > On >> Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <ke...@google.com >> > >> <mailto:ke...@google.com>> wrote: >> > >> > >> >> > >> > >> Hi, >> > >> > >> I want >> > >> to >> > >> highlight >> > >> a couple >> > >> of >> > >> improvements >> > >> to >> > >> Beam >> > >> SQL >> > >> we >> > >> have >> > >> been >> > >> > >> working >> > >> on >> > >> recently >> > >> which >> > >> are >> > >> targeted >> > >> to >> > >> make >> > >> Beam >> > >> SQL >> > >> API >> > >> easier >> > >> to >> > >> use. >> > >> > >> Specifically >> > >> these >> > >> features >> > >> simplify >> > >> conversion >> > >> of >> > >> Java >> > >> Beans >> > >> and >> > >> JSON >> > >> > >> strings >> > >> to >> > >> Rows. >> > >> > >> > >> Feel >> > >> free >> > >> to >> > >> try >> > >> this >> > >> and >> > >> send >> > >> any >> > >> bugs/comments/PRs >> > >> my >> > >> way. >> > >> > >> > >> **Caveat: >> > >> this >> > >> is >> > >> still >> > >> work >> > >> in >> > >> progress, >> > >> and >> > >> has >> > >> known >> > >> bugs >> > >> and >> > >> incomplete >> > >> > >> features, >> > >> see >> > >> below >> > >> for >> > >> details.** >> > >> > >> > >> Background >> > >> > >> > >> Beam >> > >> SQL >> > >> queries >> > >> can >> > >> only >> > >> be >> > >> applied >> > >> to >> > >> PCollection<Row>. >> > >> This >> > >> means >> > >> that >> > >> > >> users >> > >> need >> > >> to >> > >> convert >> > >> whatever >> > >> PCollection >> > >> elements >> > >> they >> > >> have >> > >> to >> > >> Rows >> > >> before >> > >> > >> querying >> > >> them >> > >> with >> > >> SQL. >> > >> This >> > >> usually >> > >> requires >> > > >