Re: Schema-Aware PCollections revisited

Romain Manni-Bucau Wed, 31 Jan 2018 11:40:50 -0800

Le 31 janv. 2018 20:16, "Reuven Lax" <[email protected]> a écrit :


As to the question of how a schema should be specified, I want to support
several common schema formats. So if a user has a Json schema, or an Avro
schema, or a Calcite schema, etc. there should be adapters that allow
setting a schema from any of them. I don't think we should prefer one over
the other. While Romain is right that many people know Json, I think far
fewer people know Json schemas.


Agree but schema would get an API for beam usage - dont think there is a
standard we can use and we cant use any vendor specific api in beam - so
not a big deal IMO/not a blocker.



Agree, schemas should not be enforced (for one thing, that wouldn't be
backwards compatible!). I think for the initial prototype I will probably
use a special coder to represent the schema (with setSchema an option on
the coder), largely because it doesn't require modifying PCollection.
However I think longer term a schema should be an optional piece of
metadata on the PCollection object. Similar to the previous discussion
about "hints," I think this can be set on the producing PTransform, and a
SetSchema PTransform will allow attaching a schema to any PCollection (i.e.
pc.apply(SetSchema.of(schema))). This part isn't designed yet, but I think
schema should be similar to hints, it's just another piece of metadata on
the PCollection (though something interpreted by the model, where hints are
interpreted by the runner)


Schema should probably be contributable from the transform when mandatory -
thinking of avro io here - or an hint as fallback when optional probably.
This sounds good to me and doesnt require another public API than hint.


Reuven

On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré <[email protected]>
wrote:

> Hi,
>
> I think we should avoid to mix two things in the discussion (and so the
> document):
>
> 1. The element of the collection and the schema itself are two different
> things.
> By essence, Beam should not enforce any schema. That's why I think it's a
> good
> idea to set the schema optionally on the PCollection
> (pcollection.setSchema()).
>
> 2. From point 1 comes two questions: how do we represent a schema ? How
> can we
> leverage the schema to simplify the serialization of the element in the
> PCollection and query ? These two questions are not directly related.
>
>  2.1 How do we represent the schema
> Json Schema is a very interesting idea. It could be an abstract and other
> providers, like Avro, can be bind on it. It's part of the json processing
> spec
> (javax).
>
>  2.2. How do we leverage the schema for query and serialization
> Also in the spec, json pointer is interesting for the querying. Regarding
> the
> serialization, jackson or other data binder can be used.
>
> It's still rough ideas in my mind, but I like Romain's idea about json-p
> usage.
>
> Once 2.3.0 release is out, I will start to update the document with those
> ideas,
> and PoC.
>
> Thanks !
> Regards
> JB
>
> On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote:
> >
> >
> > Le 30 janv. 2018 01:09, "Reuven Lax" <[email protected]
> > <mailto:[email protected]>> a écrit :
> >
> >
> >
> >     On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau <
> [email protected]
> >     <mailto:[email protected]>> wrote:
> >
> >         Hi
> >
> >         I have some questions on this: how hierarchic schemas would
> work? Seems
> >         it is not really supported by the ecosystem (out of custom
> stuff) :(.
> >         How would it integrate smoothly with other generic record types
> - N bridges?
> >
> >
> >     Do you mean nested schemas? What do you mean here?
> >
> >
> > Yes, sorry - wrote the mail too late ;). Was hierarchic data and nested
> schemas.
> >
> >
> >         Concretely I wonder if using json API couldnt be beneficial:
> json-p is a
> >         nice generic abstraction with a built in querying mecanism
> (jsonpointer)
> >         but no actual serialization (even if json and binary json are
> very
> >         natural). The big advantage is to have a well known ecosystem -
> who
> >         doesnt know json today? - that beam can reuse for free:
> JsonObject
> >         (guess we dont want JsonValue abstraction) for the record type,
> >         jsonschema standard for the schema, jsonpointer for the
> >         delection/projection etc... It doesnt enforce the actual
> serialization
> >         (json, smile, avro, ...) but provide an expressive and alread
> known API
> >         so i see it as a big win-win for users (no need to learn a new
> API and
> >         use N bridges in all ways) and beam (impls are here and API
> design
> >         already thought).
> >
> >
> >     I assume you're talking about the API for setting schemas, not using
> them.
> >     Json has many downsides and I'm not sure it's true that everyone
> knows it;
> >     there are also competing schema APIs, such as Avro etc.. However I
> think we
> >     should give Json a fair evaluation before dismissing it.
> >
> >
> > It is a wider topic than schema. Actually schema are not the first
> citizen but a
> > generic data representation is. That is where json hits almost any other
> API.
> > Then, when it comes to schema, json has a standard for that so we are
> all good.
> >
> > Also json has a good indexing API compared to alternatives which are
> sometimes a
> > bit faster - for noop transforms - but are hardly usable or make the
> code not
> > that readable.
> >
> > Avro is a nice competitor but it is compatible - actually avro is json
> driven by
> > design - but its API is far to be that easy due to its schema
> enforcement which
> > is heavvvyyy and worse is you cant work with avro without a schema. Json
> would
> > allow to reconciliate the dynamic and static cases since the job wouldnt
> change
> > except the setschema.
> >
> > That is why I think json is a good compromise and having a standard API
> for it
> > allow to fully customize the imol as will if needed - even using avro or
> protobuf.
> >
> > Side note on beam api: i dont think it is good to use a main API for
> runner
> > optimization. It enforces something to be shared on all runners but not
> widely
> > usable. It is also misleading for users. Would you set a flink pipeline
> option
> > with dataflow? My proposal here is to use hints - properties - instead of
> > something hardly defined in the API then standardize it if all runners
> support it.
> >
> >
> >
> >         Wdyt?
> >
> >         Le 29 janv. 2018 06:24, "Jean-Baptiste Onofré" <[email protected]
> >         <mailto:[email protected]>> a écrit :
> >
> >             Hi Reuven,
> >
> >             Thanks for the update ! As I'm working with you on this, I
> fully
> >             agree and great
> >             doc gathering the ideas.
> >
> >             It's clearly something we have to add asap in Beam, because
> it would
> >             allow new
> >             use cases for our users (in a simple way) and open new areas
> for the
> >             runners
> >             (for instance dataframe support in the Spark runner).
> >
> >             By the way, while ago, I created BEAM-3437 to track the
> PoC/PR
> >             around this.
> >
> >             Thanks !
> >
> >             Regards
> >             JB
> >
> >             On 01/29/2018 02:08 AM, Reuven Lax wrote:
> >             > Previously I submitted a proposal for adding schemas as a
> >             first-class concept on
> >             > Beam PCollections. The proposal engendered quite a bit of
> >             discussion from the
> >             > community - more discussion than I've seen from almost any
> of our
> >             proposals to
> >             > date!
> >             >
> >             > Based on the feedback and comments, I reworked the proposal
> >             document quite a
> >             > bit. It now talks more explicitly about the different
> between
> >             dynamic schemas
> >             > (where the schema is not fully not know at graph-creation
> time),
> >             and static
> >             > schemas (which are fully know at graph-creation time).
> Proposed
> >             APIs are more
> >             > fleshed out now (again thanks to feedback from community
> members),
> >             and the
> >             > document talks in more detail about evolving schemas in
> >             long-running streaming
> >             > pipelines.
> >             >
> >             > Please take a look. I think this will be very valuable to
> Beam,
> >             and welcome any
> >             > feedback.
> >             >
> >             >
> >             https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm
> Q12pHGK0QIvXS1FOTgRc/edit#
> >             <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruU
> mQ12pHGK0QIvXS1FOTgRc/edit#>
> >             >
> >             > Reuven
> >
> >             --
> >             Jean-Baptiste Onofré
> >             [email protected] <mailto:[email protected]>
> >             http://blog.nanthrax.net
> >             Talend - http://www.talend.com
> >
> >
> >
>
> --
> Jean-Baptiste Onofré
> [email protected]
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Schema-Aware PCollections revisited

Reply via email to