Re: Schema-Aware PCollections revisited

Romain Manni-Bucau Mon, 29 Jan 2018 12:17:29 -0800

Hi

I have some questions on this: how hierarchic schemas would work? Seems it
is not really supported by the ecosystem (out of custom stuff) :(. How
would it integrate smoothly with other generic record types - N bridges?

Concretely I wonder if using json API couldnt be beneficial: json-p is a
nice generic abstraction with a built in querying mecanism (jsonpointer)
but no actual serialization (even if json and binary json are very
natural). The big advantage is to have a well known ecosystem - who doesnt
know json today? - that beam can reuse for free: JsonObject (guess we dont
want JsonValue abstraction) for the record type, jsonschema standard for
the schema, jsonpointer for the delection/projection etc... It doesnt
enforce the actual serialization (json, smile, avro, ...) but provide an
expressive and alread known API so i see it as a big win-win for users (no
need to learn a new API and use N bridges in all ways) and beam (impls are
here and API design already thought).

Wdyt?

Le 29 janv. 2018 06:24, "Jean-Baptiste Onofré" <j...@nanthrax.net> a écrit :

> Hi Reuven,
>
> Thanks for the update ! As I'm working with you on this, I fully agree and
> great
> doc gathering the ideas.
>
> It's clearly something we have to add asap in Beam, because it would allow
> new
> use cases for our users (in a simple way) and open new areas for the
> runners
> (for instance dataframe support in the Spark runner).
>
> By the way, while ago, I created BEAM-3437 to track the PoC/PR around this.
>
> Thanks !
>
> Regards
> JB
>
> On 01/29/2018 02:08 AM, Reuven Lax wrote:
> > Previously I submitted a proposal for adding schemas as a first-class
> concept on
> > Beam PCollections. The proposal engendered quite a bit of discussion
> from the
> > community - more discussion than I've seen from almost any of our
> proposals to
> > date!
> >
> > Based on the feedback and comments, I reworked the proposal document
> quite a
> > bit. It now talks more explicitly about the different between dynamic
> schemas
> > (where the schema is not fully not know at graph-creation time), and
> static
> > schemas (which are fully know at graph-creation time). Proposed APIs are
> more
> > fleshed out now (again thanks to feedback from community members), and
> the
> > document talks in more detail about evolving schemas in long-running
> streaming
> > pipelines.
> >
> > Please take a look. I think this will be very valuable to Beam, and
> welcome any
> > feedback.
> >
> > https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHG
> K0QIvXS1FOTgRc/edit#
> >
> > Reuven
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Schema-Aware PCollections revisited

Reply via email to