Re: Schema-Aware PCollections revisited

Romain Manni-Bucau Mon, 29 Jan 2018 23:44:07 -0800

Le 30 janv. 2018 01:09, "Reuven Lax" <re...@google.com> a écrit :

On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

> Hi
>
> I have some questions on this: how hierarchic schemas would work? Seems it
> is not really supported by the ecosystem (out of custom stuff) :(. How
> would it integrate smoothly with other generic record types - N bridges?
>

Do you mean nested schemas? What do you mean here?

Yes, sorry - wrote the mail too late ;). Was hierarchic data and nested
schemas.

> Concretely I wonder if using json API couldnt be beneficial: json-p is a
> nice generic abstraction with a built in querying mecanism (jsonpointer)
> but no actual serialization (even if json and binary json are very
> natural). The big advantage is to have a well known ecosystem - who doesnt
> know json today? - that beam can reuse for free: JsonObject (guess we dont
> want JsonValue abstraction) for the record type, jsonschema standard for
> the schema, jsonpointer for the delection/projection etc... It doesnt
> enforce the actual serialization (json, smile, avro, ...) but provide an
> expressive and alread known API so i see it as a big win-win for users (no
> need to learn a new API and use N bridges in all ways) and beam (impls are
> here and API design already thought).
>

I assume you're talking about the API for setting schemas, not using them.
Json has many downsides and I'm not sure it's true that everyone knows it;
there are also competing schema APIs, such as Avro etc.. However I think we
should give Json a fair evaluation before dismissing it.

It is a wider topic than schema. Actually schema are not the first citizen
but a generic data representation is. That is where json hits almost any
other API. Then, when it comes to schema, json has a standard for that so
we are all good.

Also json has a good indexing API compared to alternatives which are
sometimes a bit faster - for noop transforms - but are hardly usable or
make the code not that readable.

Avro is a nice competitor but it is compatible - actually avro is json
driven by design - but its API is far to be that easy due to its schema
enforcement which is heavvvyyy and worse is you cant work with avro without
a schema. Json would allow to reconciliate the dynamic and static cases
since the job wouldnt change except the setschema.

That is why I think json is a good compromise and having a standard API for
it allow to fully customize the imol as will if needed - even using avro or
protobuf.

Side note on beam api: i dont think it is good to use a main API for runner
optimization. It enforces something to be shared on all runners but not
widely usable. It is also misleading for users. Would you set a flink
pipeline option with dataflow? My proposal here is to use hints -
properties - instead of something hardly defined in the API then
standardize it if all runners support it.

> Wdyt?
>
> Le 29 janv. 2018 06:24, "Jean-Baptiste Onofré" <j...@nanthrax.net> a écrit :
>
>> Hi Reuven,
>>
>> Thanks for the update ! As I'm working with you on this, I fully agree
>> and great
>> doc gathering the ideas.
>>
>> It's clearly something we have to add asap in Beam, because it would
>> allow new
>> use cases for our users (in a simple way) and open new areas for the
>> runners
>> (for instance dataframe support in the Spark runner).
>>
>> By the way, while ago, I created BEAM-3437 to track the PoC/PR around
>> this.
>>
>> Thanks !
>>
>> Regards
>> JB
>>
>> On 01/29/2018 02:08 AM, Reuven Lax wrote:
>> > Previously I submitted a proposal for adding schemas as a first-class
>> concept on
>> > Beam PCollections. The proposal engendered quite a bit of discussion
>> from the
>> > community - more discussion than I've seen from almost any of our
>> proposals to
>> > date!
>> >
>> > Based on the feedback and comments, I reworked the proposal document
>> quite a
>> > bit. It now talks more explicitly about the different between dynamic
>> schemas
>> > (where the schema is not fully not know at graph-creation time), and
>> static
>> > schemas (which are fully know at graph-creation time). Proposed APIs
>> are more
>> > fleshed out now (again thanks to feedback from community members), and
>> the
>> > document talks in more detail about evolving schemas in long-running
>> streaming
>> > pipelines.
>> >
>> > Please take a look. I think this will be very valuable to Beam, and
>> welcome any
>> > feedback.
>> >
>> > https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ
>> 12pHGK0QIvXS1FOTgRc/edit#
>> >
>> > Reuven
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>

Re: Schema-Aware PCollections revisited

Reply via email to