Re: [DISCUSSION] Add hint/option on PCollection

Jean-Baptiste Onofré Tue, 30 Jan 2018 22:26:42 -0800

Hi,

yeah, it sounds good to me. I will create the Jira to track this and start a PoC
on the Composite.


Thanks !
Regards
JB

On 01/30/2018 10:40 PM, Reuven Lax wrote:
> Did we actually reach consensus here? :)
> 
> On Tue, Jan 30, 2018 at 1:29 PM, Romain Manni-Bucau <[email protected]
> <mailto:[email protected]>> wrote:
> 
>     Not sure how it fits in terms of API yet but +1 for the high level view.
>     Makes perfect sense.
> 
>     Le 30 janv. 2018 21:41, "Jean-Baptiste Onofré" <[email protected]
>     <mailto:[email protected]>> a écrit :
> 
>         Hi Robert,
> 
>         Good point and idea for the Composite transform. It would apply nicely
>         on all transforms based on composite.
> 
>         I also agree that the hint is more on the transform than the 
> PCollection
>         itself.
> 
>         Thanks !
>         Regards
>         JB
> 
>         On 30/01/2018 21:26, Robert Bradshaw wrote:
> 
>             Many hints make more sense for PTransforms (the computation 
> itself)
>             than for PCollections. In addition, when we want properties 
> attached
>             to PCollections of themselves, it often makes sense to let these 
> be
>             provided by the producing PTransform (e.g. coders and schemas are
>             often functions of the input metadata and the operation itself, 
> and
>             can't just be set arbitrarily).
> 
>             Also, we already have a perfectly standard way of nesting 
> transforms
>             (or even sets of transforms), namely composite transforms. In 
> terms of
>             API design I would propose writing a composite transform that 
> applies
>             constraints/hints/requirements to all its inner transforms. This
>             translates nicely to the Fn API as well.
> 
>             On Tue, Jan 30, 2018 at 12:14 PM, Kenneth Knowles <[email protected]
>             <mailto:[email protected]>> wrote:
> 
>                 It seems like most of these use cases are hints on a 
> PTransform
>                 and not a
>                 PCollection, no? CPU, memory, expected parallelism, etc are.
>                 Then you could
>                 just have:
>                      pc.apply(WithHints(myTransform, <hints>))
> 
>                 For a PCollection hints that might make sense are bits like
>                 total size,
>                 element size, and throughput. All things that the Dataflow 
> folks
>                 have said
>                 should be measured instead of hinted. But I understand that we
>                 shouldn't
>                 force runners to do infeasible things like build a whole
>                 no-knobs service on
>                 top of a super-knobby engine.
> 
>                 Incidentally for portability, we have this "environment" 
> object
>                 that is
>                 basically the docker URL of an SDK harness that can execute a
>                 function. We
>                 always intended that same area of the proto (exact fields TBD)
>                 to have
>                 things like requirements for CPU, memory, GPUs, disk, etc. It 
> is
>                 likely a
>                 good place for hints.
> 
>                 BTW good idea to ask users@ for their pain points and bring 
> them
>                 back to the
>                 dev list to motivate feature design discussions.
> 
>                 Kenn
> 
>                 On Tue, Jan 30, 2018 at 12:00 PM, Reuven Lax <[email protected]
>                 <mailto:[email protected]>> wrote:
> 
> 
>                     I think the hints would logically be metadata in the
>                     pcollection, just
>                     like coder and schema.
> 
>                     On Jan 30, 2018 11:57 AM, "Jean-Baptiste Onofré"
>                     <[email protected] <mailto:[email protected]>> wrote:
> 
> 
>                         Great idea for AddHints.of() !
> 
>                         What would be the resulting PCollection ? Just a
>                         PCollection of hints or
>                         the pc elements + hints ?
> 
>                         Regards
>                         JB
> 
>                         On 30/01/2018 20:52, Reuven Lax wrote:
> 
> 
>                             I think adding hints for runners is reasonable,
>                             though hints should
>                             always be assumed to be optional - they shouldn't
>                             change semantics of the
>                             program (otherwise you destroy the portability
>                             promise of Beam). However
>                             there are many types of hints that some runners
>                             might find useful (e.g. this
>                             step needs more memory. this step runs ML
>                             algorithms, and should run on a
>                             machine with GPUs. etc.)
> 
>                             Robert has mentioned in the past that we should 
> try
>                             and keep PCollection
>                             an immutable object, and not introduce new setters
>                             on it. We slightly break
>                             this already today with PCollection.setCoder, and
>                             that has caused some
>                             problems. Hints can be set on PTransforms though,
>                             and propagate to that
>                             PTransform's output PCollections. This is nearly 
> as
>                             easy to use however, as
>                             we can implement a helper PTransform that can be
>                             used to set hints. I.e.
> 
>                             pc.apply(AddHints.of(hint1, hint2, hint3))
> 
>                             Is no harder than called pc.addHint()
> 
>                             Reuven
> 
>                             On Tue, Jan 30, 2018 at 11:39 AM, Jean-Baptiste
>                             Onofré <[email protected] 
> <mailto:[email protected]>
>                             <mailto:[email protected] 
> <mailto:[email protected]>>>
>                             wrote:
> 
>                                  Maybe I should have started the discussion on
>                             the user mailing list:
>                                  it would be great to have user feedback on
>                             this, even if I got your
>                                  points.
> 
>                                  Sometime, I have the feeling that whatever we
>                             are proposing and
>                                  discussing, it doesn't go anywhere. At some
>                             point, to attract more
>                                  people, we have to get ideas from different
>                             perspective/standpoint.
> 
>                                  Thanks for the feedback anyway.
> 
>                                  Regards
>                                  JB
> 
>                                  On 30/01/2018 20:27, Romain Manni-Bucau 
> wrote:
> 
> 
> 
>                                      2018-01-30 19:52 GMT+01:00 Kenneth 
> Knowles
>                             <[email protected] <mailto:[email protected]>
>                                      <mailto:[email protected]
>                             <mailto:[email protected]>> <mailto:[email protected]
>                             <mailto:[email protected]>
>                                      <mailto:[email protected]
>                             <mailto:[email protected]>>>>:
> 
> 
>                                           I generally like having certain
>                             "escape hatches" that are
>                             well
>                                           designed and limited in scope, and
>                             anything that turns out
>                                      to be
>                                           important becomes first-class. But
>                             this one I don't really
>                             like
>                                           because the use cases belong
>                             elsewhere. Of course, they
>                                      creep so you
>                                           should assume they will be unbounded
>                             in how much gets
>                                      stuffed into
>                                           them. And the definition of a "hint"
>                             is that deleting it
>                                      does not
>                                           change semantics, just
>                             performance/monitor/UI etc but this
>                                      does not
>                                           seem to be true.
> 
>                                           "spark.persist" for idempotent 
> replay
>                             in a sink:
>                                             - this is already 
> @RequiresStableInput
>                                             - it is not a hint because if you
>                             don't persist your
>                                      results are
>                                           incorrect
>                                             - it is a property of a DoFn /
>                             transform not a
>                             PCollection
> 
> 
>                                      Let's put this last point aside since 
> we'll
>                             manage to make it
>                                      working wherever we store it ;).
> 
> 
>                                           schema:
>                                             - should be first-class
> 
> 
>                                      Except it doesn't make sense everywhere. 
> It
>                             is exactly like
>                                      saying "implement this" and 2 lines later
>                             "it doesn't do
>                                      anything for you". If you think wider on
>                             schema you will want to
>                                      do far more - like getting them from the
>                             previous step etc... -
>                                      which makes it not an API thing. However,
>                             with some runner like
>                                      spark, being able to specifiy it will
>                             enable to optimize the
>                                      execution. There is a clear mismatch
>                             between a consistent and
>                                      user friendly generic and portable API, 
> and
>                             a runtime, runner
>                                      specific, implementation.
> 
>                                      This is all fine as an issue for a 
> portable
>                             API and why all EE
>                                      API have a map to pass properties 
> somewhere
>                             so I don't see why
>                                      beam wouldn't fall in that exact same
>                             bucket since it embraces
>                                      the drawback of the portability and we
>                             already hit it since
>                                      several releases.
> 
> 
>                                           step parallelism (you didn't mention
>                             but most runners need
>                             some
>                                           control):
>                                             - this is a property of the data 
> and
>                             the pipeline
>                                      together, not
>                                           just the pipeline
> 
> 
>                                      Good one but this can be configured from
>                             the pipeline or even a
>                                      transform. This doesn't mean the data is
>                             not important - and you
>                                      are more than right on that point, just
>                             that it is configurable
>                                      without referencing the data (using 
> ranges
>                             is a trivial example
>                                      even if not the most efficient).
> 
> 
>                                           So I just don't actually see a use
>                             case for free-form hints
>                                      that we
>                                           haven't already covered.
> 
> 
>                                      There are several cases, even in the 
> direct
>                             runner to be able to
>                                      industrialize it:
>                                      - use that particular executor instance
>                                      - debug these infos for that transform
> 
>                                      etc...
> 
>                                      As a high level design I think it is good
>                             to bring hints to beam
>                                      to avoid to add ad-hoc solution each time
>                             and take the risk to
>                                      loose the portability of the main API.
> 
> 
>                                           Kenn
> 
>                                           On Tue, Jan 30, 2018 at 9:55 AM,
>                             Romain Manni-Bucau
>                                           <[email protected]
>                             <mailto:[email protected]>
>                             <mailto:[email protected]
>                             <mailto:[email protected]>>
>                                      <mailto:[email protected]
>                             <mailto:[email protected]>
>                             <mailto:[email protected]
>                             <mailto:[email protected]>>>>
>                                      wrote:
> 
>                                               Lukasz, the point is that you 
> have
>                             to choice to either
>                                      bring all
>                                               specificities to the main API
>                             which makes most of the
>                                      API not
>                                               usable or implemented or the
>                             opposite, not support
>                                      anything.
>                                               Introducing hints will allow to
>                             have eagerly for some
>                                      runners
>                                               some features - or just some 
> very
>                             specific things - and
>                                      once
>                                               mainstream it can find a place 
> in
>                             the main API. This is
>                                      saner
>                                               than the opposite since some
>                             specificities can never
>                                      find a good
>                                               place.
> 
>                                               The little thing we need to take
>                             care with that is to
>                                      avoid to
>                                               introduce some feature flipping 
> as
>                             support some feature
>                             not
>                                               doable with another runner. It
>                             should really be about
>                                      runing a
>                                               runner execution (like the 
> schema
>                             in spark).
> 
> 
>                                               Romain Manni-Bucau
>                                               @rmannibucau
>                             <https://twitter.com/rmannibucau
>                             <https://twitter.com/rmannibucau>
>                                      <https://twitter.com/rmannibucau
>                             <https://twitter.com/rmannibucau>>> | Blog
>                                               
> <https://rmannibucau.metawerx.net/
>                             <https://rmannibucau.metawerx.net/>
>                                      <https://rmannibucau.metawerx.net/
>                             <https://rmannibucau.metawerx.net/>>> | Old Blog
>                                               
> <http://rmannibucau.wordpress.com
>                             <http://rmannibucau.wordpress.com>
>                                      <http://rmannibucau.wordpress.com
>                             <http://rmannibucau.wordpress.com>>> | Github
>                                               <https://github.com/rmannibucau
>                             <https://github.com/rmannibucau>
>                                      <https://github.com/rmannibucau
>                             <https://github.com/rmannibucau>>> | LinkedIn
>                                              
>                             <https://www.linkedin.com/in/rmannibucau
>                             <https://www.linkedin.com/in/rmannibucau>
>                                      <https://www.linkedin.com/in/rmannibucau
>                             <https://www.linkedin.com/in/rmannibucau>>>
> 
>                                               2018-01-30 18:42 GMT+01:00
>                             Jean-Baptiste Onofré
>                                      <[email protected] 
> <mailto:[email protected]>
>                             <mailto:[email protected] 
> <mailto:[email protected]>>
>                                               <mailto:[email protected]
>                             <mailto:[email protected]> 
> <mailto:[email protected]
>                             <mailto:[email protected]>>>>:
> 
>                                                   Good point Luke: in that 
> case,
>                             the hint will be
>                                      ignored by
>                                                   the runner if the hint is 
> not
>                             for him. The hint can
>                             be
>                                                   generic (not specific to a
>                             runner). It could be
>                                      interesting
>                                                   for the schema support or 
> IOs,
>                             not specific to a
>                                      runner.
> 
>                                                   What do you mean by 
> gathering
>                                      PTransforms/PCollections and
>                                                   where ?
> 
>                                                   Thanks !
>                                                   Regards
>                                                   JB
> 
>                                                   On 30/01/2018 18:35, Lukasz
>                             Cwik wrote:
> 
>                                                       If the hint is required 
> to
>                             run the persons
>                             pipeline
>                                                       well, how do you expect
>                             that the person we be
>                                      able to
>                                                       migrate their pipeline 
> to
>                             another runner?
> 
>                                                       A lot of hints like
>                             "spark.persist" are really
>                                      the user
>                                                       trying to tell us
>                             something about the
>                                      PCollection, like
>                                                       it is very small. I 
> would
>                             prefer if we gathered
>                                      this
>                                                       information about
>                             PTransforms and PCollections
>                                      instead
>                                                       of runner specific knobs
>                             since then each runner
>                             can
>                                                       choose how best to map
>                             such a property on their
>                                      internal
>                                                       representation.
> 
>                                                       On Tue, Jan 30, 2018 at
>                             2:21 AM, Jean-Baptiste
>                                      Onofré
>                                                       <[email protected]
>                             <mailto:[email protected]> 
> <mailto:[email protected]
>                             <mailto:[email protected]>>
>                                      <mailto:[email protected]
>                             <mailto:[email protected]> 
> <mailto:[email protected]
>                             <mailto:[email protected]>>>
>                                                       
> <mailto:[email protected]
>                             <mailto:[email protected]>
>                                      <mailto:[email protected]
>                             <mailto:[email protected]>> 
> <mailto:[email protected]
>                             <mailto:[email protected]>
>                                      <mailto:[email protected]
>                             <mailto:[email protected]>>>>> wrote:
> 
>                                                            Hi,
> 
>                                                            As part of the
>                             discussion about schema,
>                             Romain
>                                                       mentioned hint. I
>                                                            think it's
>                                                            worth to have an
>                             explanation about that
>                             and
>                                                       especially it could be
>                                                            wider than
>                                                            schema.
> 
>                                                            Today, to give
>                             information to the runner,
>                                      we use
>                                                       PipelineOptions.
>                                                            The runner can
>                                                            use these options,
>                             and apply for all inner
>                                                       representation of the
>                                                            PCollection in
>                                                            the runner.
> 
>                                                            For instance, for 
> the
>                             Spark runner, the
>                                      persistence
>                                                       storage level
>                                                            (memory, disk,
>                                                            ...) can be defined
>                             via pipeline options.
> 
>                                                            Then, the Spark
>                             runner automatically
>                                      defines if
>                                                       RDDs have to be
>                                                            persisted (using
>                                                            the storage level
>                             defined in the pipeline
>                                      options),
>                                                       for instance if
>                                                            the same
>                                                            POutput/PCollection
>                             is read several time.
> 
>                                                            However, the user
>                             doesn't have any way to
>                                      provide
>                                                       indication to the
>                                                            runner to
>                                                            deal with a 
> specific
>                             PCollection.
> 
>                                                            Imagine, the user 
> has
>                             a pipeline like
>                             this:
>                                                          
>                              pipeline.apply().apply().apply(). We
>                                                            have three
>                             PCollections involved in this
>                                      pipeline.
>                                                       It's not
>                                                            currently possible
>                                                            to give indications
>                             how the runner should
>                                                       "optimized" and deal 
> with
>                                                            the second
>                                                            PCollection only.
> 
>                                                            The idea is to add 
> a
>                             method on the
>                                      PCollection:
> 
>                                                          
>                              PCollection.addHint(String key, Object
>                             value);
> 
>                                                            For instance:
> 
>                                                          
>                              collection.addHint("spark.persist",
>                                                       
> StorageLevel.MEMORY_ONLY);
> 
>                                                            I see three direct
>                             usage of this:
> 
>                                                            1. Related to 
> schema:
>                             the schema
>                                      definition could
>                                                       be a hint
>                                                            2. Related to the 
> IO:
>                             add headers for the
>                                      IO and
>                                                       the runner how to
>                                                            specifically
>                                                            process a 
> collection.
>                             In Apache Camel, we
>                             have
>                                                       headers on the
>                                                            message and
>                                                            properties on the
>                             exchange similar to
>                             this. It
>                                                       allows to give some
>                                                            indication
>                                                            how to process some
>                             messages on the Camel
>                                                       component. We can 
> imagine
>                                                            the same of
>                                                            the IO (using the
>                             PCollection hints to
>                             react
>                                                       accordingly).
>                                                            3. Related to 
> runner
>                             optimization: I see
>                             for
>                                                       instance a way to use
>                                                            RDD or
>                                                            dataframe in Spark
>                             runner, or even
>                             specific
>                                                       optimization like
>                                                            persist. I had lot
>                                                            of questions from
>                             Spark users saying: "in
>                                      my Spark
>                                                       job, I know where
>                                                            and how I
>                                                            should use persist
>                             (rdd.persist()), but I
>                                      can't do
>                                                       such optimization
>                                                            using
>                                                            Beam". So it could 
> be
>                             a good improvements.
> 
>                                                            Thoughts ?
> 
>                                                            Regards
>                                                            JB
>                                                            --
>                                                            Jean-Baptiste 
> Onofré
>                                      [email protected]
>                             <mailto:[email protected]>
>                             <mailto:[email protected]
>                             <mailto:[email protected]>>
>                                      <mailto:[email protected]
>                             <mailto:[email protected]>
>                             <mailto:[email protected]
>                             <mailto:[email protected]>>>
>                                                      
>                             <mailto:[email protected] 
> <mailto:[email protected]>
>                                      <mailto:[email protected]
>                             <mailto:[email protected]>>
>                             <mailto:[email protected] 
> <mailto:[email protected]>
>                                      <mailto:[email protected]
>                             <mailto:[email protected]>>>>
>                                      http://blog.nanthrax.net
>                                                            Talend -
>                             http://www.talend.com
> 
> 
> 
> 
> 
> 
> 
> 

-- 
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [DISCUSSION] Add hint/option on PCollection

Reply via email to