Re: Schema-Aware PCollections revisited

Jean-Baptiste Onofré Mon, 05 Mar 2018 04:31:20 -0800

Cool,

can I work with you on this (sharing a branch for instance) ?


Thanks !
Regards
JB

On 03/05/2018 01:01 PM, Reuven Lax wrote:
> Yes, I do have a PoC in progress. The Beam Row class was being refactored, so 
> I
> paused to wait for that to finish.
> 
> 
> On Sun, Mar 4, 2018 at 8:24 PM Jean-Baptiste Onofré <[email protected]
> <mailto:[email protected]>> wrote:
> 
>     Hi Reuven,
> 
>     I revive this discussion as I think it would be a great addition.
> 
>     We had discussion on the fly, but I think now, as base for discussion, it 
> would
>     be great to have a feature branch where we can start some sketch/impl and
>     discuss.
> 
>     @Reuven, did you start a PoC with what you proposed:
>     - SchemaCoder
>     - SchemaRegistry
>     - @FieldAccess on DoFn
>     - Select.fields PTransform
>     ?
> 
>     If not, I'm volunteer to start the branch and start to sketch.
> 
>     Thoughts ?
> 
>     Regards
>     JB
> 
>     On 02/04/2018 08:23 PM, Reuven Lax wrote:
>     > Cool, let's chat about this on slack for a bit (which I realized I've 
> been
>     > signed out of for some time).
>     >
>     > Reuven
>     >
>     > On Sun, Feb 4, 2018 at 9:21 AM, Jean-Baptiste Onofré <[email protected]
>     <mailto:[email protected]>
>     > <mailto:[email protected] <mailto:[email protected]>>> wrote:
>     >
>     >     Sorry guys, I was off today. Happy to be part of the party too ;)
>     >
>     >     Regards
>     >     JB
>     >
>     >     On 02/04/2018 06:19 PM, Reuven Lax wrote:
>     >     > Romain, since you're interested maybe the two of us should put
>     together a
>     >     > proposal for how to set this things (hints, schema) on 
> PCollections?
>     I don't
>     >     > think it'll be hard - the previous list thread on hints already
>     agreed on a
>     >     > general approach, and we would just need to flesh it out.
>     >     >
>     >     > BTW in the past when I looked, Json schemas seemed to have some 
> odd
>     limitations
>     >     > inherited from Javascript (e.g. no distinction between integer and
>     >     > floating-point types). Is that still true?
>     >     >
>     >     > Reuven
>     >     >
>     >     > On Sun, Feb 4, 2018 at 9:12 AM, Romain Manni-Bucau
>     <[email protected] <mailto:[email protected]>
>     <mailto:[email protected] <mailto:[email protected]>>
>     >     > <mailto:[email protected] <mailto:[email protected]>
>     <mailto:[email protected] <mailto:[email protected]>>>> wrote:
>     >     >
>     >     >
>     >     >
>     >     >     2018-02-04 17:53 GMT+01:00 Reuven Lax <[email protected]
>     <mailto:[email protected]> <mailto:[email protected] 
> <mailto:[email protected]>>
>     >     >     <mailto:[email protected] <mailto:[email protected]>
>     <mailto:[email protected] <mailto:[email protected]>>>>:
>     >     >
>     >     >
>     >     >
>     >     >         On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau
>     >     >         <[email protected] <mailto:[email protected]>
>     <mailto:[email protected] <mailto:[email protected]>>
>     >     <mailto:[email protected] <mailto:[email protected]>
>     <mailto:[email protected] <mailto:[email protected]>>>> wrote:
>     >     >
>     >     >
>     >     >             2018-02-04 17:37 GMT+01:00 Reuven Lax 
> <[email protected]
>     <mailto:[email protected]> <mailto:[email protected] 
> <mailto:[email protected]>>
>     >     >             <mailto:[email protected] <mailto:[email protected]>
>     <mailto:[email protected] <mailto:[email protected]>>>>:
>     >     >
>     >     >                 I'm not sure where proto comes from here. Proto is
>     one example
>     >     >                 of a type that has a schema, but only one example.
>     >     >
>     >     >                 1. In the initial prototype I want to avoid
>     modifying the
>     >     >                 PCollection API. So I think it's best to create a
>     special
>     >     >                 SchemaCoder, and pass the schema into this coder.
>     Later we
>     >     might
>     >     >                 targeted APIs for this instead of going through a 
> coder.
>     >     >                 1.a I don't see what hints have to do with this? 
>     >     >
>     >     >
>     >     >             Hints are a way to replace the new API and unify the 
> way
>     to pass
>     >     >             metadata in beam instead of adding a new custom way 
> each
>     time.
>     >     >
>     >     >
>     >     >         I don't think schema is a hint. But I hear what your 
> saying
>     - hint
>     >     is a
>     >     >         type of PCollection metadata as is schema, and we should 
> have a
>     >     unified
>     >     >         API for setting such metadata. 
>     >     >
>     >     >
>     >     >     :), Ismael pointed me out earlier this week that "hint" had an
>     old meaning
>     >     >     in beam. My usage is purely the one done in most EE spec (your
>     >     "metadata" in
>     >     >     previous answer). But guess we are aligned on the meaning now,
>     just wanted
>     >     >     to be sure.
>     >     >      
>     >     >
>     >     >          
>     >     >
>     >     >              
>     >     >
>     >     >
>     >     >                 2. BeamSQL already has a generic record type 
> which fits
>     >     this use
>     >     >                 case very well (though we might modify it). 
> However as
>     >     mentioned
>     >     >                 in the doc, the user is never forced to use this 
> generic
>     >     record
>     >     >                 type.
>     >     >
>     >     >
>     >     >             Well yes and not. A type already exists but 1. it is
>     very strictly
>     >     >             limited (flat/columns only which is very few of what 
> big
>     data SQL
>     >     >             can do) and 2. it must be aligned on the converge of
>     generic data
>     >     >             the schema will bring (really read "aligned" as 
> "dropped
>     in favor
>     >     >             of" - deprecated being a smooth way to do it).
>     >     >
>     >     >
>     >     >         As I said the existing class needs to be modified and 
> extended,
>     >     and not
>     >     >         just for this schema us was. It was meant to represent
>     Calcite SQL
>     >     rows,
>     >     >         but doesn't quite even do that yet (Calcite supports 
> nested
>     rows).
>     >     >         However I think it's the right basis to start from.
>     >     >
>     >     >
>     >     >     Agree on the state. Current impl issues I hit (additionally to
>     the nested
>     >     >     support which would require by itself a kind of visitor
>     solution) are the
>     >     >     fact to own the schema in the record and handle field by 
> field the
>     >     >     serialization instead of as a whole which is how it would be 
> handled
>     >     with a
>     >     >     schema IMHO.
>     >     >
>     >     >     Concretely what I don't want is to do a PoC which works - they
>     all work
>     >     >     right? and integrate to beam without thinking to a global
>     solution for
>     >     this
>     >     >     generic record issue and its schema standardization. This is 
> where
>     >     Json(-P)
>     >     >     has a lot of value IMHO but requires a bit more love than just
>     adding
>     >     schema
>     >     >     in the model.
>     >     >      
>     >     >
>     >     >          
>     >     >
>     >     >
>     >     >             So long story short the main work of this schema track
>     is not only
>     >     >             on using schema in runners and other ways but also
>     starting to
>     >     make
>     >     >             beam consistent with itself which is probably the most
>     important
>     >     >             outcome since it is the user facing side of this work.
>     >     >              
>     >     >
>     >     >
>     >     >                 On Sun, Feb 4, 2018 at 12:22 AM, Romain 
> Manni-Bucau
>     >     >                 <[email protected]
>     <mailto:[email protected]> <mailto:[email protected]
>     <mailto:[email protected]>>
>     >     <mailto:[email protected] <mailto:[email protected]>
>     <mailto:[email protected] <mailto:[email protected]>>>> wrote:
>     >     >
>     >     >                     @Reuven: is the proto only about passing 
> schema
>     or also the
>     >     >                     generic type?
>     >     >
>     >     >                     There are 2.5 topics to solve this issue:
>     >     >
>     >     >                     1. How to pass schema
>     >     >                     1.a. hints?
>     >     >                     2. What is the generic record type associated 
> to
>     a schema
>     >     >                     and how to express a schema relatively to it
>     >     >
>     >     >                     I would be happy to help on 1.a and 2 somehow 
> if
>     you need.
>     >     >
>     >     >                     Le 4 févr. 2018 03:30, "Reuven Lax"
>     <[email protected] <mailto:[email protected]> <mailto:[email protected]
>     <mailto:[email protected]>>
>     >     >                     <mailto:[email protected]
>     <mailto:[email protected]> <mailto:[email protected]
>     <mailto:[email protected]>>>> a
>     >     écrit :
>     >     >
>     >     >                         One more thing. If anyone here has
>     experience with
>     >     >                         various OSS metadata stores (e.g. Kafka
>     Schema Registry
>     >     >                         is one example), would you like to
>     collaborate on
>     >     >                         implementation? I want to make sure that
>     source schemas
>     >     >                         can be stored in a variety of OSS metadata
>     stores, and
>     >     >                         be easily pulled into a Beam pipeline.
>     >     >
>     >     >                         Reuven
>     >     >
>     >     >                         On Sat, Feb 3, 2018 at 6:28 PM, Reuven Lax
>     >     >                         <[email protected] 
> <mailto:[email protected]>
>     <mailto:[email protected] <mailto:[email protected]>> 
> <mailto:[email protected]
>     <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>>> wrote:
>     >     >
>     >     >                             Hi all,
>     >     >
>     >     >                             If there are no concerns, I would like
>     to start
>     >     >                             working on a prototype. It's just a
>     prototype, so I
>     >     >                             don't think it will have the final API
>     (e.g. for the
>     >     >                             prototype I'm going to avoid change 
> the
>     API of
>     >     >                             PCollection, and use a "special" Coder
>     instead).
>     >     >                             Also even once we go beyond prototype,
>     it will be
>     >     >                             @Experimental for some time, so the 
> API
>     will not be
>     >     >                             fixed in stone.
>     >     >
>     >     >                             Any more comments on this approach
>     before we start
>     >     >                             implementing a prototype?
>     >     >
>     >     >                             Reuven
>     >     >
>     >     >                             On Wed, Jan 31, 2018 at 1:12 PM, 
> Romain
>     Manni-Bucau
>     >     >                             <[email protected]
>     <mailto:[email protected]> <mailto:[email protected]
>     <mailto:[email protected]>>
>     >     >                             <mailto:[email protected]
>     <mailto:[email protected]> <mailto:[email protected]
>     <mailto:[email protected]>>>> wrote:
>     >     >
>     >     >                                 If you need help on the json part
>     I'm happy to
>     >     >                                 help. To give a few hints on what 
> is
>     very
>     >     >                                 doable: we can add an avro module 
> to
>     johnzon
>     >     >                                 (asf json{p,b} impl) to back jsonp
>     by avro
>     >     >                                 (guess it will be one of the first
>     to be asked)
>     >     >                                 for instance.
>     >     >
>     >     >
>     >     >                                 Romain Manni-Bucau
>     >     >                                 @rmannibucau
>     >     <https://twitter.com/rmannibucau <https://twitter.com/rmannibucau>> 
> |
>     >     >                                  Blog 
> <https://rmannibucau.metawerx.net/
>     >     <https://rmannibucau.metawerx.net/>> | Old
>     >     >                                 Blog 
> <http://rmannibucau.wordpress.com
>     >     <http://rmannibucau.wordpress.com>> | Github
>     >     >                                 <https://github.com/rmannibucau
>     >     <https://github.com/rmannibucau>> | LinkedIn
>     >     >                                 
> <https://www.linkedin.com/in/rmannibucau
>     >     <https://www.linkedin.com/in/rmannibucau>>
>     >     >
>     >     >                                 2018-01-31 22:06 GMT+01:00 Reuven 
> Lax
>     >     >                                 <[email protected]
>     <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>
>     <mailto:[email protected] <mailto:[email protected]> 
> <mailto:[email protected]
>     <mailto:[email protected]>>>>:
>     >     >
>     >     >                                     Agree. The initial
>     implementation will be a
>     >     >                                     prototype.
>     >     >
>     >     >                                     On Wed, Jan 31, 2018 at 12:21 
> PM,
>     >     >                                     Jean-Baptiste Onofré
>     <[email protected] <mailto:[email protected]> <mailto:[email protected]
>     <mailto:[email protected]>>
>     >     >                                     <mailto:[email protected]
>     <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>>> wrote:
>     >     >
>     >     >                                         Hi Reuven,
>     >     >
>     >     >                                         Agree to be able to 
> describe the
>     >     schema
>     >     >                                         with different format. The
>     good point
>     >     >                                         about json schemas is that
>     they are
>     >     >                                         described by a spec. My 
> point is
>     >     also to
>     >     >                                         avoid the reinvent the
>     wheel. Just an
>     >     >                                         abstract to be able to use
>     Avro, Json,
>     >     >                                         Calcite, custom schema
>     descriptors
>     >     would
>     >     >                                         be great.
>     >     >
>     >     >                                         Using coder to describe a 
> schema
>     >     sounds
>     >     >                                         like a smart move to 
> implement
>     >     quickly.
>     >     >                                         However, it has to be 
> clear
>     in term of
>     >     >                                         documentation to avoid 
> "side
>     >     effect". I
>     >     >                                         still think
>     PCollection.setSchema() is
>     >     >                                         better: it should be
>     metadata (or hint
>     >     >                                         ;))) on the PCollection.
>     >     >
>     >     >                                         Regards
>     >     >                                         JB
>     >     >
>     >     >                                         On 31/01/2018 20:16, 
> Reuven
>     Lax wrote:
>     >     >
>     >     >                                             As to the question of
>     how a schema
>     >     >                                             should be specified, I
>     want to
>     >     >                                             support several common
>     schema
>     >     >                                             formats. So if a user
>     has a Json
>     >     >                                             schema, or an Avro
>     schema, or a
>     >     >                                             Calcite schema, etc. 
> there
>     >     should be
>     >     >                                             adapters that allow
>     setting a
>     >     schema
>     >     >                                             from any of them. I
>     don't think we
>     >     >                                             should prefer one over
>     the other.
>     >     >                                             While Romain is right
>     that many
>     >     >                                             people know Json, I
>     think far
>     >     fewer
>     >     >                                             people know Json 
> schemas.
>     >     >
>     >     >                                             Agree, schemas should 
> not be
>     >     >                                             enforced (for one 
> thing,
>     that
>     >     >                                             wouldn't be backwards
>     >     compatible!).
>     >     >                                             I think for the 
> initial
>     >     prototype I
>     >     >                                             will probably use a 
> special
>     >     coder to
>     >     >                                             represent the schema 
> (with
>     >     setSchema
>     >     >                                             an option on the 
> coder),
>     largely
>     >     >                                             because it doesn't 
> require
>     >     modifying
>     >     >                                             PCollection. However 
> I think
>     >     longer
>     >     >                                             term a schema should 
> be an
>     >     optional
>     >     >                                             piece of metadata on 
> the
>     >     PCollection
>     >     >                                             object. Similar to the
>     previous
>     >     >                                             discussion about
>     "hints," I think
>     >     >                                             this can be set on the
>     producing
>     >     >                                             PTransform, and a 
> SetSchema
>     >     >                                             PTransform will allow
>     attaching a
>     >     >                                             schema to any
>     PCollection (i.e.
>     >     >                                           
>      pc.apply(SetSchema.of(schema))).
>     >     >                                             This part isn't 
> designed
>     yet,
>     >     but I
>     >     >                                             think schema should be
>     similar to
>     >     >                                             hints, it's just 
> another
>     piece of
>     >     >                                             metadata on the 
> PCollection
>     >     (though
>     >     >                                             something interpreted 
> by the
>     >     model,
>     >     >                                             where hints are
>     interpreted by the
>     >     >                                             runner)
>     >     >
>     >     >                                             Reuven
>     >     >
>     >     >                                             On Tue, Jan 30, 2018 
> at
>     1:37 AM,
>     >     >                                             Jean-Baptiste Onofré
>     >     >                                             <[email protected]
>     <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>
>     >     >                                             
> <mailto:[email protected]
>     <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>>
>     >     >                                             
> <mailto:[email protected]
>     <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>
>     >     >                                             
> <mailto:[email protected]
>     <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>>>> wrote:
>     >     >
>     >     >                                                 Hi,
>     >     >
>     >     >                                                 I think we should
>     avoid to mix
>     >     >                                             two things in the 
> discussion
>     >     (and so
>     >     >                                                 the document):
>     >     >
>     >     >                                                 1. The element of 
> the
>     >     collection
>     >     >                                             and the schema itself
>     are two
>     >     >                                                 different things.
>     >     >                                                 By essence, Beam
>     should not
>     >     >                                             enforce any schema.
>     That's why
>     >     I think
>     >     >                                                 it's a good
>     >     >                                                 idea to set the 
> schema
>     >     >                                             optionally on the
>     PCollection
>     >     >                                                
>     (pcollection.setSchema()).
>     >     >
>     >     >                                                 2. From point 1
>     comes two
>     >     >                                             questions: how do we
>     represent a
>     >     >                                             schema ?
>     >     >                                                 How can we
>     >     >                                                 leverage the 
> schema to
>     >     simplify
>     >     >                                             the serialization of 
> the
>     >     element in the
>     >     >                                                 PCollection and
>     query ? These
>     >     >                                             two questions are not
>     directly
>     >     related.
>     >     >
>     >     >                                                   2.1 How do we
>     represent
>     >     the schema
>     >     >                                                 Json Schema is a 
> very
>     >     >                                             interesting idea. It
>     could be an
>     >     >                                             abstract and
>     >     >                                                 other
>     >     >                                                 providers, like
>     Avro, can be
>     >     >                                             bind on it. It's part 
> of
>     the json
>     >     >                                                 processing spec
>     >     >                                                 (javax).
>     >     >
>     >     >                                                   2.2. How do we
>     leverage the
>     >     >                                             schema for query and
>     serialization
>     >     >                                                 Also in the spec,
>     json pointer
>     >     >                                             is interesting for the
>     querying.
>     >     >                                                 Regarding the
>     >     >                                                 serialization,
>     jackson or
>     >     other
>     >     >                                             data binder can be 
> used.
>     >     >
>     >     >                                                 It's still rough
>     ideas in my
>     >     >                                             mind, but I like
>     Romain's idea
>     >     about
>     >     >                                                 json-p usage.
>     >     >
>     >     >                                                 Once 2.3.0 release
>     is out, I
>     >     >                                             will start to update 
> the
>     >     document with
>     >     >                                                 those ideas,
>     >     >                                                 and PoC.
>     >     >
>     >     >                                                 Thanks !
>     >     >                                                 Regards
>     >     >                                                 JB
>     >     >
>     >     >                                                 On 01/30/2018 
> 08:42
>     AM, Romain
>     >     >                                             Manni-Bucau wrote:
>     >     >                                                 >
>     >     >                                                 >
>     >     >                                                 > Le 30 janv. 2018
>     01:09,
>     >     >                                             "Reuven Lax"
>     <[email protected] <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>
>     >     >                                             
> <mailto:[email protected]
>     <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>>
>     >     >                                             
> <mailto:[email protected]
>     <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>
>     >     >                                             
> <mailto:[email protected]
>     <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>>>
>     >     >                                                  >
>     >     <mailto:[email protected] <mailto:[email protected]>
>     <mailto:[email protected] <mailto:[email protected]>>
>     >     >                                             
> <mailto:[email protected]
>     <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>>
>     >     >                                             
> <mailto:[email protected]
>     <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>
>     >     >                                             
> <mailto:[email protected]
>     <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>>>>> a écrit :
>     >     >                                                 >
>     >     >                                                 >
>     >     >                                                 >
>     >     >                                                 >     On Mon, Jan
>     29, 2018 at
>     >     >                                             12:17 PM, Romain 
> Manni-Bucau
>     >     >                                             <[email protected]
>     <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>
>     >     >                                           
>      <mailto:[email protected] <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>>
>     >     >                                           
>      <mailto:[email protected] <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>
>     >     >                                           
>      <mailto:[email protected] <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>>>
>     >     >                                                  >   
>     >     >                                           
>       <mailto:[email protected] <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>
>     >     >                                           
>      <mailto:[email protected] <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>>
>     >     >
>     >     >                                                
>     >     <mailto:[email protected] <mailto:[email protected]>
>     <mailto:[email protected] <mailto:[email protected]>>
>     >     >                                           
>      <mailto:[email protected] <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>>>>> 
> wrote:
>     >     >                                                  >
>     >     >                                                  >         Hi
>     >     >                                                  >
>     >     >                                                  >         I have 
> some
>     >     questions
>     >     >                                             on this: how 
> hierarchic
>     schemas
>     >     >                                                 would work? Seems
>     >     >                                                  >         it is 
> not
>     really
>     >     >                                             supported by the
>     ecosystem (out of
>     >     >                                                 custom stuff) :(.
>     >     >                                                  >         How 
> would it
>     >     >                                             integrate smoothly 
> with
>     other
>     >     >                                             generic record
>     >     >                                                 types - N bridges?
>     >     >                                                  >
>     >     >                                                  >
>     >     >                                                  >     Do you mean
>     nested
>     >     >                                             schemas? What do you
>     mean here?
>     >     >                                                  >
>     >     >                                                  >
>     >     >                                                  > Yes, sorry -
>     wrote the mail
>     >     >                                             too late ;). Was 
> hierarchic
>     >     data and
>     >     >                                                 nested schemas.
>     >     >                                                  >
>     >     >                                                  >
>     >     >                                                  >       
>      Concretely I wonder
>     >     >                                             if using json API 
> couldnt be
>     >     >                                                 beneficial: 
> json-p is a
>     >     >                                                  >         nice 
> generic
>     >     >                                             abstraction with a 
> built in
>     >     querying
>     >     >                                                 mecanism 
> (jsonpointer)
>     >     >                                                  >         but no 
> actual
>     >     >                                             serialization (even if
>     json and
>     >     >                                             binary json
>     >     >                                                 are very
>     >     >                                                  >         
> natural).
>     The big
>     >     >                                             advantage is to have a
>     well known
>     >     >                                                 ecosystem - who
>     >     >                                                  >         doesnt
>     know json
>     >     >                                             today? - that beam 
> can reuse
>     >     for free:
>     >     >                                                 JsonObject
>     >     >                                                  >         (guess 
> we
>     dont want
>     >     >                                             JsonValue abstraction)
>     for the
>     >     record
>     >     >                                                 type,
>     >     >                                                  >       
>      jsonschema standard
>     >     >                                             for the schema, 
> jsonpointer
>     >     for the
>     >     >                                                  >       
>     >      delection/projection
>     >     >                                             etc... It doesnt 
> enforce the
>     >     actual
>     >     >                                                 serialization
>     >     >                                                  >         (json,
>     smile, avro,
>     >     >                                             ...) but provide an
>     expressive and
>     >     >                                                 alread known API
>     >     >                                                  >         so i 
> see
>     it as
>     >     a big
>     >     >                                             win-win for users (no
>     need to
>     >     learn
>     >     >                                                 a new API and
>     >     >                                                  >         use N 
> bridges
>     >     in all
>     >     >                                             ways) and beam (impls
>     are here and
>     >     >                                                 API design
>     >     >                                                  >         already
>     thought).
>     >     >                                                  >
>     >     >                                                  >
>     >     >                                                  >     I assume
>     you're talking
>     >     >                                             about the API for
>     setting schemas,
>     >     >                                                 not using them.
>     >     >                                                  >     Json has 
> many
>     downsides
>     >     >                                             and I'm not sure it's
>     true that
>     >     >                                                 everyone knows it;
>     >     >                                                  >     there are 
> also
>     >     competing
>     >     >                                             schema APIs, such as
>     Avro etc..
>     >     >                                                 However I think we
>     >     >                                                  >     should give
>     Json a fair
>     >     >                                             evaluation before
>     dismissing it.
>     >     >                                                  >
>     >     >                                                  >
>     >     >                                                  > It is a wider
>     topic than
>     >     >                                             schema. Actually 
> schema are
>     >     not the
>     >     >                                                 first citizen but 
> a
>     >     >                                                  > generic data
>     representation
>     >     >                                             is. That is where json
>     hits almost
>     >     >                                                 any other API.
>     >     >                                                  > Then, when it
>     comes to
>     >     >                                             schema, json has a 
> standard
>     >     for that
>     >     >                                             so we
>     >     >                                                 are all good.
>     >     >                                                  >
>     >     >                                                  > Also json has 
> a good
>     >     indexing
>     >     >                                             API compared to
>     alternatives which
>     >     >                                                 are sometimes a
>     >     >                                                  > bit faster - 
> for noop
>     >     >                                             transforms - but are
>     hardly usable
>     >     >                                             or make
>     >     >                                                 the code not
>     >     >                                                  > that readable.
>     >     >                                                  >
>     >     >                                                  > Avro is a nice
>     >     competitor but
>     >     >                                             it is compatible - 
> actually
>     >     avro is
>     >     >                                                 json driven by
>     >     >                                                  > design - but 
> its
>     API is far
>     >     >                                             to be that easy due to
>     its schema
>     >     >                                                 enforcement which
>     >     >                                                  > is heavvvyyy 
> and
>     worse
>     >     is you
>     >     >                                             cant work with avro
>     without a
>     >     >                                                 schema. Json would
>     >     >                                                  > allow to
>     reconciliate the
>     >     >                                             dynamic and static 
> cases
>     since
>     >     the job
>     >     >                                                 wouldnt change
>     >     >                                                  > except the 
> setschema.
>     >     >                                                  >
>     >     >                                                  > That is why I 
> think
>     >     json is a
>     >     >                                             good compromise and 
> having a
>     >     >                                                 standard API for 
> it
>     >     >                                                  > allow to fully
>     >     customize the
>     >     >                                             imol as will if 
> needed -
>     even
>     >     using
>     >     >                                                 avro or protobuf.
>     >     >                                                  >
>     >     >                                                  > Side note on 
> beam
>     api:
>     >     i dont
>     >     >                                             think it is good to 
> use
>     a main API
>     >     >                                                 for runner
>     >     >                                                  > optimization. 
> It
>     enforces
>     >     >                                             something to be shared
>     on all
>     >     runners
>     >     >                                                 but not widely
>     >     >                                                  > usable. It is 
> also
>     >     misleading
>     >     >                                             for users. Would you 
> set
>     a flink
>     >     >                                                 pipeline option
>     >     >                                                  > with dataflow? 
> My
>     proposal
>     >     >                                             here is to use hints -
>     >     properties -
>     >     >                                                 instead of
>     >     >                                                  > something 
> hardly
>     defined in
>     >     >                                             the API then 
> standardize
>     it if all
>     >     >                                                 runners support 
> it.
>     >     >                                                  >
>     >     >                                                  >
>     >     >                                                  >
>     >     >                                                  >         Wdyt?
>     >     >                                                  >
>     >     >                                                  >         Le 29
>     janv. 2018
>     >     >                                             06:24, "Jean-Baptiste
>     Onofré"
>     >     >                                                 <[email protected]
>     <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>
>     >     >                                             
> <mailto:[email protected]
>     <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>>
>     >     >                                             
> <mailto:[email protected]
>     <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>
>     >     >                                             
> <mailto:[email protected]
>     <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>>>
>     >     >                                                  >       
>     >     >                                              
> <mailto:[email protected]
>     <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>
>     >     >                                             
> <mailto:[email protected]
>     <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>>
>     >     >                                             
> <mailto:[email protected]
>     <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>
>     >     >                                             
> <mailto:[email protected]
>     <mailto:[email protected]>
>     >     <mailto:[email protected] <mailto:[email protected]>>>>>> a écrit :
>     >     >
>     >     >                                                  >
>     >     >                                                  >             Hi
>     Reuven,
>     >     >                                                  >
>     >     >                                                  >           
>      Thanks for the
>     >     >                                             update ! As I'm 
> working with
>     >     you on
>     >     >                                                 this, I fully
>     >     >                                                  >             
> agree
>     and great
>     >     >                                                  >             doc
>     >     gathering the
>     >     >                                             ideas.
>     >     >                                                  >
>     >     >                                                  >             
> It's
>     clearly
>     >     >                                             something we have to 
> add
>     asap
>     >     in Beam,
>     >     >                                                 because it would
>     >     >                                                  >             
> allow new
>     >     >                                                  >             
> use cases
>     >     for our
>     >     >                                             users (in a simple 
> way)
>     and open
>     >     >                                                 new areas for the
>     >     >                                                  >             
> runners
>     >     >                                                  >             
> (for
>     instance
>     >     >                                             dataframe support in 
> the
>     Spark
>     >     runner).
>     >     >                                                  >
>     >     >                                                  >             By
>     the way,
>     >     while
>     >     >                                             ago, I created 
> BEAM-3437 to 
> 

-- 
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Schema-Aware PCollections revisited

Reply via email to