Re: Schema-Aware PCollections revisited

Reuven Lax Sat, 03 Feb 2018 18:29:07 -0800

Hi all,

If there are no concerns, I would like to start working on a prototype.
It's just a prototype, so I don't think it will have the final API (e.g.
for the prototype I'm going to avoid change the API of PCollection, and use
a "special" Coder instead). Also even once we go beyond prototype, it will
be @Experimental for some time, so the API will not be fixed in stone.


Any more comments on this approach before we start implementing a prototype?

Reuven

On Wed, Jan 31, 2018 at 1:12 PM, Romain Manni-Bucau <[email protected]>
wrote:

> If you need help on the json part I'm happy to help. To give a few hints
> on what is very doable: we can add an avro module to johnzon (asf json{p,b}
> impl) to back jsonp by avro (guess it will be one of the first to be asked)
> for instance.
>
>
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> <https://rmannibucau.metawerx.net/> | Old Blog
> <http://rmannibucau.wordpress.com> | Github
> <https://github.com/rmannibucau> | LinkedIn
> <https://www.linkedin.com/in/rmannibucau>
>
> 2018-01-31 22:06 GMT+01:00 Reuven Lax <[email protected]>:
>
>> Agree. The initial implementation will be a prototype.
>>
>> On Wed, Jan 31, 2018 at 12:21 PM, Jean-Baptiste Onofré <[email protected]>
>> wrote:
>>
>>> Hi Reuven,
>>>
>>> Agree to be able to describe the schema with different format. The good
>>> point about json schemas is that they are described by a spec. My point is
>>> also to avoid the reinvent the wheel. Just an abstract to be able to use
>>> Avro, Json, Calcite, custom schema descriptors would be great.
>>>
>>> Using coder to describe a schema sounds like a smart move to implement
>>> quickly. However, it has to be clear in term of documentation to avoid
>>> "side effect". I still think PCollection.setSchema() is better: it should
>>> be metadata (or hint ;))) on the PCollection.
>>>
>>> Regards
>>> JB
>>>
>>> On 31/01/2018 20:16, Reuven Lax wrote:
>>>
>>>> As to the question of how a schema should be specified, I want to
>>>> support several common schema formats. So if a user has a Json schema, or
>>>> an Avro schema, or a Calcite schema, etc. there should be adapters that
>>>> allow setting a schema from any of them. I don't think we should prefer one
>>>> over the other. While Romain is right that many people know Json, I think
>>>> far fewer people know Json schemas.
>>>>
>>>> Agree, schemas should not be enforced (for one thing, that wouldn't be
>>>> backwards compatible!). I think for the initial prototype I will probably
>>>> use a special coder to represent the schema (with setSchema an option on
>>>> the coder), largely because it doesn't require modifying PCollection.
>>>> However I think longer term a schema should be an optional piece of
>>>> metadata on the PCollection object. Similar to the previous discussion
>>>> about "hints," I think this can be set on the producing PTransform, and a
>>>> SetSchema PTransform will allow attaching a schema to any PCollection (i.e.
>>>> pc.apply(SetSchema.of(schema))). This part isn't designed yet, but I
>>>> think schema should be similar to hints, it's just another piece of
>>>> metadata on the PCollection (though something interpreted by the model,
>>>> where hints are interpreted by the runner)
>>>>
>>>> Reuven
>>>>
>>>> On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré <[email protected]
>>>> <mailto:[email protected]>> wrote:
>>>>
>>>>     Hi,
>>>>
>>>>     I think we should avoid to mix two things in the discussion (and so
>>>>     the document):
>>>>
>>>>     1. The element of the collection and the schema itself are two
>>>>     different things.
>>>>     By essence, Beam should not enforce any schema. That's why I think
>>>>     it's a good
>>>>     idea to set the schema optionally on the PCollection
>>>>     (pcollection.setSchema()).
>>>>
>>>>     2. From point 1 comes two questions: how do we represent a schema ?
>>>>     How can we
>>>>     leverage the schema to simplify the serialization of the element in
>>>> the
>>>>     PCollection and query ? These two questions are not directly
>>>> related.
>>>>
>>>>       2.1 How do we represent the schema
>>>>     Json Schema is a very interesting idea. It could be an abstract and
>>>>     other
>>>>     providers, like Avro, can be bind on it. It's part of the json
>>>>     processing spec
>>>>     (javax).
>>>>
>>>>       2.2. How do we leverage the schema for query and serialization
>>>>     Also in the spec, json pointer is interesting for the querying.
>>>>     Regarding the
>>>>     serialization, jackson or other data binder can be used.
>>>>
>>>>     It's still rough ideas in my mind, but I like Romain's idea about
>>>>     json-p usage.
>>>>
>>>>     Once 2.3.0 release is out, I will start to update the document with
>>>>     those ideas,
>>>>     and PoC.
>>>>
>>>>     Thanks !
>>>>     Regards
>>>>     JB
>>>>
>>>>     On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote:
>>>>     >
>>>>     >
>>>>     > Le 30 janv. 2018 01:09, "Reuven Lax" <[email protected] <mailto:
>>>> [email protected]>
>>>>      > <mailto:[email protected] <mailto:[email protected]>>> a écrit :
>>>>     >
>>>>     >
>>>>     >
>>>>     >     On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau <
>>>> [email protected] <mailto:[email protected]>
>>>>      >     <mailto:[email protected]
>>>>
>>>>     <mailto:[email protected]>>> wrote:
>>>>      >
>>>>      >         Hi
>>>>      >
>>>>      >         I have some questions on this: how hierarchic schemas
>>>>     would work? Seems
>>>>      >         it is not really supported by the ecosystem (out of
>>>>     custom stuff) :(.
>>>>      >         How would it integrate smoothly with other generic record
>>>>     types - N bridges?
>>>>      >
>>>>      >
>>>>      >     Do you mean nested schemas? What do you mean here?
>>>>      >
>>>>      >
>>>>      > Yes, sorry - wrote the mail too late ;). Was hierarchic data and
>>>>     nested schemas.
>>>>      >
>>>>      >
>>>>      >         Concretely I wonder if using json API couldnt be
>>>>     beneficial: json-p is a
>>>>      >         nice generic abstraction with a built in querying
>>>>     mecanism (jsonpointer)
>>>>      >         but no actual serialization (even if json and binary json
>>>>     are very
>>>>      >         natural). The big advantage is to have a well known
>>>>     ecosystem - who
>>>>      >         doesnt know json today? - that beam can reuse for free:
>>>>     JsonObject
>>>>      >         (guess we dont want JsonValue abstraction) for the record
>>>>     type,
>>>>      >         jsonschema standard for the schema, jsonpointer for the
>>>>      >         delection/projection etc... It doesnt enforce the actual
>>>>     serialization
>>>>      >         (json, smile, avro, ...) but provide an expressive and
>>>>     alread known API
>>>>      >         so i see it as a big win-win for users (no need to learn
>>>>     a new API and
>>>>      >         use N bridges in all ways) and beam (impls are here and
>>>>     API design
>>>>      >         already thought).
>>>>      >
>>>>      >
>>>>      >     I assume you're talking about the API for setting schemas,
>>>>     not using them.
>>>>      >     Json has many downsides and I'm not sure it's true that
>>>>     everyone knows it;
>>>>      >     there are also competing schema APIs, such as Avro etc..
>>>>     However I think we
>>>>      >     should give Json a fair evaluation before dismissing it.
>>>>      >
>>>>      >
>>>>      > It is a wider topic than schema. Actually schema are not the
>>>>     first citizen but a
>>>>      > generic data representation is. That is where json hits almost
>>>>     any other API.
>>>>      > Then, when it comes to schema, json has a standard for that so we
>>>>     are all good.
>>>>      >
>>>>      > Also json has a good indexing API compared to alternatives which
>>>>     are sometimes a
>>>>      > bit faster - for noop transforms - but are hardly usable or make
>>>>     the code not
>>>>      > that readable.
>>>>      >
>>>>      > Avro is a nice competitor but it is compatible - actually avro is
>>>>     json driven by
>>>>      > design - but its API is far to be that easy due to its schema
>>>>     enforcement which
>>>>      > is heavvvyyy and worse is you cant work with avro without a
>>>>     schema. Json would
>>>>      > allow to reconciliate the dynamic and static cases since the job
>>>>     wouldnt change
>>>>      > except the setschema.
>>>>      >
>>>>      > That is why I think json is a good compromise and having a
>>>>     standard API for it
>>>>      > allow to fully customize the imol as will if needed - even using
>>>>     avro or protobuf.
>>>>      >
>>>>      > Side note on beam api: i dont think it is good to use a main API
>>>>     for runner
>>>>      > optimization. It enforces something to be shared on all runners
>>>>     but not widely
>>>>      > usable. It is also misleading for users. Would you set a flink
>>>>     pipeline option
>>>>      > with dataflow? My proposal here is to use hints - properties -
>>>>     instead of
>>>>      > something hardly defined in the API then standardize it if all
>>>>     runners support it.
>>>>      >
>>>>      >
>>>>      >
>>>>      >         Wdyt?
>>>>      >
>>>>      >         Le 29 janv. 2018 06:24, "Jean-Baptiste Onofré"
>>>>     <[email protected] <mailto:[email protected]>
>>>>      >         <mailto:[email protected] <mailto:[email protected]>>> a
>>>> écrit :
>>>>
>>>>      >
>>>>      >             Hi Reuven,
>>>>      >
>>>>      >             Thanks for the update ! As I'm working with you on
>>>>     this, I fully
>>>>      >             agree and great
>>>>      >             doc gathering the ideas.
>>>>      >
>>>>      >             It's clearly something we have to add asap in Beam,
>>>>     because it would
>>>>      >             allow new
>>>>      >             use cases for our users (in a simple way) and open
>>>>     new areas for the
>>>>      >             runners
>>>>      >             (for instance dataframe support in the Spark runner).
>>>>      >
>>>>      >             By the way, while ago, I created BEAM-3437 to track
>>>>     the PoC/PR
>>>>      >             around this.
>>>>      >
>>>>      >             Thanks !
>>>>      >
>>>>      >             Regards
>>>>      >             JB
>>>>      >
>>>>      >             On 01/29/2018 02:08 AM, Reuven Lax wrote:
>>>>      >             > Previously I submitted a proposal for adding
>>>>     schemas as a
>>>>      >             first-class concept on
>>>>      >             > Beam PCollections. The proposal engendered quite a
>>>>     bit of
>>>>      >             discussion from the
>>>>      >             > community - more discussion than I've seen from
>>>>     almost any of our
>>>>      >             proposals to
>>>>      >             > date!
>>>>      >             >
>>>>      >             > Based on the feedback and comments, I reworked the
>>>>     proposal
>>>>      >             document quite a
>>>>      >             > bit. It now talks more explicitly about the
>>>>     different between
>>>>      >             dynamic schemas
>>>>      >             > (where the schema is not fully not know at
>>>>     graph-creation time),
>>>>      >             and static
>>>>      >             > schemas (which are fully know at graph-creation
>>>>     time). Proposed
>>>>      >             APIs are more
>>>>      >             > fleshed out now (again thanks to feedback from
>>>>     community members),
>>>>      >             and the
>>>>      >             > document talks in more detail about evolving
>>>> schemas in
>>>>      >             long-running streaming
>>>>      >             > pipelines.
>>>>      >             >
>>>>      >             > Please take a look. I think this will be very
>>>>     valuable to Beam,
>>>>      >             and welcome any
>>>>      >             > feedback.
>>>>      >             >
>>>>      >             >
>>>>      >
>>>>     https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ
>>>> 12pHGK0QIvXS1FOTgRc/edit#
>>>>     <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm
>>>> Q12pHGK0QIvXS1FOTgRc/edit#>
>>>>      >                 <https://docs.google.com/docu
>>>> ment/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit# <
>>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm
>>>> Q12pHGK0QIvXS1FOTgRc/edit#>>
>>>>      >             >
>>>>      >             > Reuven
>>>>      >
>>>>      >             --
>>>>      >             Jean-Baptiste Onofré
>>>>      > [email protected] <mailto:[email protected]>
>>>>     <mailto:[email protected] <mailto:[email protected]>>
>>>>      > http://blog.nanthrax.net
>>>>      >             Talend - http://www.talend.com
>>>>      >
>>>>      >
>>>>      >
>>>>
>>>>     --
>>>>     Jean-Baptiste Onofré
>>>>     [email protected] <mailto:[email protected]>
>>>>     http://blog.nanthrax.net
>>>>     Talend - http://www.talend.com
>>>>
>>>>
>>>>
>>
>

Re: Schema-Aware PCollections revisited

Reply via email to