Re: Schema-Aware PCollections revisited

Romain Manni-Bucau Sun, 04 Feb 2018 00:23:52 -0800

@Reuven: is the proto only about passing schema or also the generic type?

There are 2.5 topics to solve this issue:


1. How to pass schema
1.a. hints?
2. What is the generic record type associated to a schema and how to
express a schema relatively to it

I would be happy to help on 1.a and 2 somehow if you need.

Le 4 févr. 2018 03:30, "Reuven Lax" <re...@google.com> a écrit :

> One more thing. If anyone here has experience with various OSS metadata
> stores (e.g. Kafka Schema Registry is one example), would you like to
> collaborate on implementation? I want to make sure that source schemas can
> be stored in a variety of OSS metadata stores, and be easily pulled into a
> Beam pipeline.
>
> Reuven
>
> On Sat, Feb 3, 2018 at 6:28 PM, Reuven Lax <re...@google.com> wrote:
>
>> Hi all,
>>
>> If there are no concerns, I would like to start working on a prototype.
>> It's just a prototype, so I don't think it will have the final API (e.g.
>> for the prototype I'm going to avoid change the API of PCollection, and use
>> a "special" Coder instead). Also even once we go beyond prototype, it will
>> be @Experimental for some time, so the API will not be fixed in stone.
>>
>> Any more comments on this approach before we start implementing a
>> prototype?
>>
>> Reuven
>>
>> On Wed, Jan 31, 2018 at 1:12 PM, Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>> If you need help on the json part I'm happy to help. To give a few hints
>>> on what is very doable: we can add an avro module to johnzon (asf json{p,b}
>>> impl) to back jsonp by avro (guess it will be one of the first to be asked)
>>> for instance.
>>>
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>> <http://rmannibucau.wordpress.com> | Github
>>> <https://github.com/rmannibucau> | LinkedIn
>>> <https://www.linkedin.com/in/rmannibucau>
>>>
>>> 2018-01-31 22:06 GMT+01:00 Reuven Lax <re...@google.com>:
>>>
>>>> Agree. The initial implementation will be a prototype.
>>>>
>>>> On Wed, Jan 31, 2018 at 12:21 PM, Jean-Baptiste Onofré <j...@nanthrax.net
>>>> > wrote:
>>>>
>>>>> Hi Reuven,
>>>>>
>>>>> Agree to be able to describe the schema with different format. The
>>>>> good point about json schemas is that they are described by a spec. My
>>>>> point is also to avoid the reinvent the wheel. Just an abstract to be able
>>>>> to use Avro, Json, Calcite, custom schema descriptors would be great.
>>>>>
>>>>> Using coder to describe a schema sounds like a smart move to implement
>>>>> quickly. However, it has to be clear in term of documentation to avoid
>>>>> "side effect". I still think PCollection.setSchema() is better: it should
>>>>> be metadata (or hint ;))) on the PCollection.
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On 31/01/2018 20:16, Reuven Lax wrote:
>>>>>
>>>>>> As to the question of how a schema should be specified, I want to
>>>>>> support several common schema formats. So if a user has a Json schema, or
>>>>>> an Avro schema, or a Calcite schema, etc. there should be adapters that
>>>>>> allow setting a schema from any of them. I don't think we should prefer 
>>>>>> one
>>>>>> over the other. While Romain is right that many people know Json, I think
>>>>>> far fewer people know Json schemas.
>>>>>>
>>>>>> Agree, schemas should not be enforced (for one thing, that wouldn't
>>>>>> be backwards compatible!). I think for the initial prototype I will
>>>>>> probably use a special coder to represent the schema (with setSchema an
>>>>>> option on the coder), largely because it doesn't require modifying
>>>>>> PCollection. However I think longer term a schema should be an optional
>>>>>> piece of metadata on the PCollection object. Similar to the previous
>>>>>> discussion about "hints," I think this can be set on the producing
>>>>>> PTransform, and a SetSchema PTransform will allow attaching a schema to 
>>>>>> any
>>>>>> PCollection (i.e. pc.apply(SetSchema.of(schema))). This part isn't
>>>>>> designed yet, but I think schema should be similar to hints, it's just
>>>>>> another piece of metadata on the PCollection (though something 
>>>>>> interpreted
>>>>>> by the model, where hints are interpreted by the runner)
>>>>>>
>>>>>> Reuven
>>>>>>
>>>>>> On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré <
>>>>>> j...@nanthrax.net <mailto:j...@nanthrax.net>> wrote:
>>>>>>
>>>>>>     Hi,
>>>>>>
>>>>>>     I think we should avoid to mix two things in the discussion (and
>>>>>> so
>>>>>>     the document):
>>>>>>
>>>>>>     1. The element of the collection and the schema itself are two
>>>>>>     different things.
>>>>>>     By essence, Beam should not enforce any schema. That's why I think
>>>>>>     it's a good
>>>>>>     idea to set the schema optionally on the PCollection
>>>>>>     (pcollection.setSchema()).
>>>>>>
>>>>>>     2. From point 1 comes two questions: how do we represent a schema
>>>>>> ?
>>>>>>     How can we
>>>>>>     leverage the schema to simplify the serialization of the element
>>>>>> in the
>>>>>>     PCollection and query ? These two questions are not directly
>>>>>> related.
>>>>>>
>>>>>>       2.1 How do we represent the schema
>>>>>>     Json Schema is a very interesting idea. It could be an abstract
>>>>>> and
>>>>>>     other
>>>>>>     providers, like Avro, can be bind on it. It's part of the json
>>>>>>     processing spec
>>>>>>     (javax).
>>>>>>
>>>>>>       2.2. How do we leverage the schema for query and serialization
>>>>>>     Also in the spec, json pointer is interesting for the querying.
>>>>>>     Regarding the
>>>>>>     serialization, jackson or other data binder can be used.
>>>>>>
>>>>>>     It's still rough ideas in my mind, but I like Romain's idea about
>>>>>>     json-p usage.
>>>>>>
>>>>>>     Once 2.3.0 release is out, I will start to update the document
>>>>>> with
>>>>>>     those ideas,
>>>>>>     and PoC.
>>>>>>
>>>>>>     Thanks !
>>>>>>     Regards
>>>>>>     JB
>>>>>>
>>>>>>     On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote:
>>>>>>     >
>>>>>>     >
>>>>>>     > Le 30 janv. 2018 01:09, "Reuven Lax" <re...@google.com <mailto:
>>>>>> re...@google.com>
>>>>>>      > <mailto:re...@google.com <mailto:re...@google.com>>> a écrit :
>>>>>>     >
>>>>>>     >
>>>>>>     >
>>>>>>     >     On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau <
>>>>>> rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>
>>>>>>      >     <mailto:rmannibu...@gmail.com
>>>>>>
>>>>>>     <mailto:rmannibu...@gmail.com>>> wrote:
>>>>>>      >
>>>>>>      >         Hi
>>>>>>      >
>>>>>>      >         I have some questions on this: how hierarchic schemas
>>>>>>     would work? Seems
>>>>>>      >         it is not really supported by the ecosystem (out of
>>>>>>     custom stuff) :(.
>>>>>>      >         How would it integrate smoothly with other generic
>>>>>> record
>>>>>>     types - N bridges?
>>>>>>      >
>>>>>>      >
>>>>>>      >     Do you mean nested schemas? What do you mean here?
>>>>>>      >
>>>>>>      >
>>>>>>      > Yes, sorry - wrote the mail too late ;). Was hierarchic data
>>>>>> and
>>>>>>     nested schemas.
>>>>>>      >
>>>>>>      >
>>>>>>      >         Concretely I wonder if using json API couldnt be
>>>>>>     beneficial: json-p is a
>>>>>>      >         nice generic abstraction with a built in querying
>>>>>>     mecanism (jsonpointer)
>>>>>>      >         but no actual serialization (even if json and binary
>>>>>> json
>>>>>>     are very
>>>>>>      >         natural). The big advantage is to have a well known
>>>>>>     ecosystem - who
>>>>>>      >         doesnt know json today? - that beam can reuse for free:
>>>>>>     JsonObject
>>>>>>      >         (guess we dont want JsonValue abstraction) for the
>>>>>> record
>>>>>>     type,
>>>>>>      >         jsonschema standard for the schema, jsonpointer for the
>>>>>>      >         delection/projection etc... It doesnt enforce the
>>>>>> actual
>>>>>>     serialization
>>>>>>      >         (json, smile, avro, ...) but provide an expressive and
>>>>>>     alread known API
>>>>>>      >         so i see it as a big win-win for users (no need to
>>>>>> learn
>>>>>>     a new API and
>>>>>>      >         use N bridges in all ways) and beam (impls are here and
>>>>>>     API design
>>>>>>      >         already thought).
>>>>>>      >
>>>>>>      >
>>>>>>      >     I assume you're talking about the API for setting schemas,
>>>>>>     not using them.
>>>>>>      >     Json has many downsides and I'm not sure it's true that
>>>>>>     everyone knows it;
>>>>>>      >     there are also competing schema APIs, such as Avro etc..
>>>>>>     However I think we
>>>>>>      >     should give Json a fair evaluation before dismissing it.
>>>>>>      >
>>>>>>      >
>>>>>>      > It is a wider topic than schema. Actually schema are not the
>>>>>>     first citizen but a
>>>>>>      > generic data representation is. That is where json hits almost
>>>>>>     any other API.
>>>>>>      > Then, when it comes to schema, json has a standard for that so
>>>>>> we
>>>>>>     are all good.
>>>>>>      >
>>>>>>      > Also json has a good indexing API compared to alternatives
>>>>>> which
>>>>>>     are sometimes a
>>>>>>      > bit faster - for noop transforms - but are hardly usable or
>>>>>> make
>>>>>>     the code not
>>>>>>      > that readable.
>>>>>>      >
>>>>>>      > Avro is a nice competitor but it is compatible - actually avro
>>>>>> is
>>>>>>     json driven by
>>>>>>      > design - but its API is far to be that easy due to its schema
>>>>>>     enforcement which
>>>>>>      > is heavvvyyy and worse is you cant work with avro without a
>>>>>>     schema. Json would
>>>>>>      > allow to reconciliate the dynamic and static cases since the
>>>>>> job
>>>>>>     wouldnt change
>>>>>>      > except the setschema.
>>>>>>      >
>>>>>>      > That is why I think json is a good compromise and having a
>>>>>>     standard API for it
>>>>>>      > allow to fully customize the imol as will if needed - even
>>>>>> using
>>>>>>     avro or protobuf.
>>>>>>      >
>>>>>>      > Side note on beam api: i dont think it is good to use a main
>>>>>> API
>>>>>>     for runner
>>>>>>      > optimization. It enforces something to be shared on all runners
>>>>>>     but not widely
>>>>>>      > usable. It is also misleading for users. Would you set a flink
>>>>>>     pipeline option
>>>>>>      > with dataflow? My proposal here is to use hints - properties -
>>>>>>     instead of
>>>>>>      > something hardly defined in the API then standardize it if all
>>>>>>     runners support it.
>>>>>>      >
>>>>>>      >
>>>>>>      >
>>>>>>      >         Wdyt?
>>>>>>      >
>>>>>>      >         Le 29 janv. 2018 06:24, "Jean-Baptiste Onofré"
>>>>>>     <j...@nanthrax.net <mailto:j...@nanthrax.net>
>>>>>>      >         <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>> a
>>>>>> écrit :
>>>>>>
>>>>>>      >
>>>>>>      >             Hi Reuven,
>>>>>>      >
>>>>>>      >             Thanks for the update ! As I'm working with you on
>>>>>>     this, I fully
>>>>>>      >             agree and great
>>>>>>      >             doc gathering the ideas.
>>>>>>      >
>>>>>>      >             It's clearly something we have to add asap in Beam,
>>>>>>     because it would
>>>>>>      >             allow new
>>>>>>      >             use cases for our users (in a simple way) and open
>>>>>>     new areas for the
>>>>>>      >             runners
>>>>>>      >             (for instance dataframe support in the Spark
>>>>>> runner).
>>>>>>      >
>>>>>>      >             By the way, while ago, I created BEAM-3437 to track
>>>>>>     the PoC/PR
>>>>>>      >             around this.
>>>>>>      >
>>>>>>      >             Thanks !
>>>>>>      >
>>>>>>      >             Regards
>>>>>>      >             JB
>>>>>>      >
>>>>>>      >             On 01/29/2018 02:08 AM, Reuven Lax wrote:
>>>>>>      >             > Previously I submitted a proposal for adding
>>>>>>     schemas as a
>>>>>>      >             first-class concept on
>>>>>>      >             > Beam PCollections. The proposal engendered quite
>>>>>> a
>>>>>>     bit of
>>>>>>      >             discussion from the
>>>>>>      >             > community - more discussion than I've seen from
>>>>>>     almost any of our
>>>>>>      >             proposals to
>>>>>>      >             > date!
>>>>>>      >             >
>>>>>>      >             > Based on the feedback and comments, I reworked
>>>>>> the
>>>>>>     proposal
>>>>>>      >             document quite a
>>>>>>      >             > bit. It now talks more explicitly about the
>>>>>>     different between
>>>>>>      >             dynamic schemas
>>>>>>      >             > (where the schema is not fully not know at
>>>>>>     graph-creation time),
>>>>>>      >             and static
>>>>>>      >             > schemas (which are fully know at graph-creation
>>>>>>     time). Proposed
>>>>>>      >             APIs are more
>>>>>>      >             > fleshed out now (again thanks to feedback from
>>>>>>     community members),
>>>>>>      >             and the
>>>>>>      >             > document talks in more detail about evolving
>>>>>> schemas in
>>>>>>      >             long-running streaming
>>>>>>      >             > pipelines.
>>>>>>      >             >
>>>>>>      >             > Please take a look. I think this will be very
>>>>>>     valuable to Beam,
>>>>>>      >             and welcome any
>>>>>>      >             > feedback.
>>>>>>      >             >
>>>>>>      >             >
>>>>>>      >
>>>>>>     https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ
>>>>>> 12pHGK0QIvXS1FOTgRc/edit#
>>>>>>     <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm
>>>>>> Q12pHGK0QIvXS1FOTgRc/edit#>
>>>>>>      >                 <https://docs.google.com/docu
>>>>>> ment/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit# <
>>>>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm
>>>>>> Q12pHGK0QIvXS1FOTgRc/edit#>>
>>>>>>      >             >
>>>>>>      >             > Reuven
>>>>>>      >
>>>>>>      >             --
>>>>>>      >             Jean-Baptiste Onofré
>>>>>>      > jbono...@apache.org <mailto:jbono...@apache.org>
>>>>>>     <mailto:jbono...@apache.org <mailto:jbono...@apache.org>>
>>>>>>      > http://blog.nanthrax.net
>>>>>>      >             Talend - http://www.talend.com
>>>>>>      >
>>>>>>      >
>>>>>>      >
>>>>>>
>>>>>>     --
>>>>>>     Jean-Baptiste Onofré
>>>>>>     jbono...@apache.org <mailto:jbono...@apache.org>
>>>>>>     http://blog.nanthrax.net
>>>>>>     Talend - http://www.talend.com
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: Schema-Aware PCollections revisited

Reply via email to