Hmm, can work for pipeline hints but for transform hints we would need: p.apply(AddHint.of(.....).wrap(originalTransform))
Would work for me too. Romain Manni-Bucau @rmannibucau <https://twitter.com/rmannibucau> | Blog <https://rmannibucau.metawerx.net/> | Old Blog <http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> | LinkedIn <https://www.linkedin.com/in/rmannibucau> 2018-01-30 20:57 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net>: > Great idea for AddHints.of() ! > > What would be the resulting PCollection ? Just a PCollection of hints or > the pc elements + hints ? > > Regards > JB > > On 30/01/2018 20:52, Reuven Lax wrote: > >> I think adding hints for runners is reasonable, though hints should >> always be assumed to be optional - they shouldn't change semantics of the >> program (otherwise you destroy the portability promise of Beam). However >> there are many types of hints that some runners might find useful (e.g. >> this step needs more memory. this step runs ML algorithms, and should run >> on a machine with GPUs. etc.) >> >> Robert has mentioned in the past that we should try and keep PCollection >> an immutable object, and not introduce new setters on it. We slightly break >> this already today with PCollection.setCoder, and that has caused some >> problems. Hints can be set on PTransforms though, and propagate to that >> PTransform's output PCollections. This is nearly as easy to use however, as >> we can implement a helper PTransform that can be used to set hints. I.e. >> >> pc.apply(AddHints.of(hint1, hint2, hint3)) >> >> Is no harder than called pc.addHint() >> >> Reuven >> >> On Tue, Jan 30, 2018 at 11:39 AM, Jean-Baptiste Onofré <j...@nanthrax.net >> <mailto:j...@nanthrax.net>> wrote: >> >> Maybe I should have started the discussion on the user mailing list: >> it would be great to have user feedback on this, even if I got your >> points. >> >> Sometime, I have the feeling that whatever we are proposing and >> discussing, it doesn't go anywhere. At some point, to attract more >> people, we have to get ideas from different perspective/standpoint. >> >> Thanks for the feedback anyway. >> >> Regards >> JB >> >> On 30/01/2018 20:27, Romain Manni-Bucau wrote: >> >> >> >> 2018-01-30 19:52 GMT+01:00 Kenneth Knowles <k...@google.com >> <mailto:k...@google.com> <mailto:k...@google.com >> >> <mailto:k...@google.com>>>: >> >> >> I generally like having certain "escape hatches" that are >> well >> designed and limited in scope, and anything that turns out >> to be >> important becomes first-class. But this one I don't really >> like >> because the use cases belong elsewhere. Of course, they >> creep so you >> should assume they will be unbounded in how much gets >> stuffed into >> them. And the definition of a "hint" is that deleting it >> does not >> change semantics, just performance/monitor/UI etc but this >> does not >> seem to be true. >> >> "spark.persist" for idempotent replay in a sink: >> - this is already @RequiresStableInput >> - it is not a hint because if you don't persist your >> results are >> incorrect >> - it is a property of a DoFn / transform not a PCollection >> >> >> Let's put this last point aside since we'll manage to make it >> working wherever we store it ;). >> >> >> schema: >> - should be first-class >> >> >> Except it doesn't make sense everywhere. It is exactly like >> saying "implement this" and 2 lines later "it doesn't do >> anything for you". If you think wider on schema you will want to >> do far more - like getting them from the previous step etc... - >> which makes it not an API thing. However, with some runner like >> spark, being able to specifiy it will enable to optimize the >> execution. There is a clear mismatch between a consistent and >> user friendly generic and portable API, and a runtime, runner >> specific, implementation. >> >> This is all fine as an issue for a portable API and why all EE >> API have a map to pass properties somewhere so I don't see why >> beam wouldn't fall in that exact same bucket since it embraces >> the drawback of the portability and we already hit it since >> several releases. >> >> >> step parallelism (you didn't mention but most runners need >> some >> control): >> - this is a property of the data and the pipeline >> together, not >> just the pipeline >> >> >> Good one but this can be configured from the pipeline or even a >> transform. This doesn't mean the data is not important - and you >> are more than right on that point, just that it is configurable >> without referencing the data (using ranges is a trivial example >> even if not the most efficient). >> >> >> So I just don't actually see a use case for free-form hints >> that we >> haven't already covered. >> >> >> There are several cases, even in the direct runner to be able to >> industrialize it: >> - use that particular executor instance >> - debug these infos for that transform >> >> etc... >> >> As a high level design I think it is good to bring hints to beam >> to avoid to add ad-hoc solution each time and take the risk to >> loose the portability of the main API. >> >> >> Kenn >> >> On Tue, Jan 30, 2018 at 9:55 AM, Romain Manni-Bucau >> <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com> >> <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>>> >> >> wrote: >> >> Lukasz, the point is that you have to choice to either >> bring all >> specificities to the main API which makes most of the >> API not >> usable or implemented or the opposite, not support >> anything. >> Introducing hints will allow to have eagerly for some >> runners >> some features - or just some very specific things - and >> once >> mainstream it can find a place in the main API. This is >> saner >> than the opposite since some specificities can never >> find a good >> place. >> >> The little thing we need to take care with that is to >> avoid to >> introduce some feature flipping as support some feature >> not >> doable with another runner. It should really be about >> runing a >> runner execution (like the schema in spark). >> >> >> Romain Manni-Bucau >> @rmannibucau <https://twitter.com/rmannibucau >> <https://twitter.com/rmannibucau>> | Blog >> <https://rmannibucau.metawerx.net/ >> <https://rmannibucau.metawerx.net/>> | Old Blog >> <http://rmannibucau.wordpress.com >> <http://rmannibucau.wordpress.com>> | Github >> <https://github.com/rmannibucau >> <https://github.com/rmannibucau>> | LinkedIn >> <https://www.linkedin.com/in/rmannibucau >> <https://www.linkedin.com/in/rmannibucau>> >> >> 2018-01-30 18:42 GMT+01:00 Jean-Baptiste Onofré >> <j...@nanthrax.net <mailto:j...@nanthrax.net> >> <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>>: >> >> >> Good point Luke: in that case, the hint will be >> ignored by >> the runner if the hint is not for him. The hint can >> be >> generic (not specific to a runner). It could be >> interesting >> for the schema support or IOs, not specific to a >> runner. >> >> What do you mean by gathering >> PTransforms/PCollections and >> where ? >> >> Thanks ! >> Regards >> JB >> >> On 30/01/2018 18:35, Lukasz Cwik wrote: >> >> If the hint is required to run the persons >> pipeline >> well, how do you expect that the person we be >> able to >> migrate their pipeline to another runner? >> >> A lot of hints like "spark.persist" are really >> the user >> trying to tell us something about the >> PCollection, like >> it is very small. I would prefer if we gathered >> this >> information about PTransforms and PCollections >> instead >> of runner specific knobs since then each runner >> can >> choose how best to map such a property on their >> internal >> representation. >> >> On Tue, Jan 30, 2018 at 2:21 AM, Jean-Baptiste >> Onofré >> <j...@nanthrax.net <mailto:j...@nanthrax.net> >> <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>> >> <mailto:j...@nanthrax.net >> <mailto:j...@nanthrax.net> <mailto:j...@nanthrax.net >> <mailto:j...@nanthrax.net>>>> wrote: >> >> Hi, >> >> As part of the discussion about schema, >> Romain >> mentioned hint. I >> think it's >> worth to have an explanation about that and >> especially it could be >> wider than >> schema. >> >> Today, to give information to the runner, >> we use >> PipelineOptions. >> The runner can >> use these options, and apply for all inner >> representation of the >> PCollection in >> the runner. >> >> For instance, for the Spark runner, the >> persistence >> storage level >> (memory, disk, >> ...) can be defined via pipeline options. >> >> Then, the Spark runner automatically >> defines if >> RDDs have to be >> persisted (using >> the storage level defined in the pipeline >> options), >> for instance if >> the same >> POutput/PCollection is read several time. >> >> However, the user doesn't have any way to >> provide >> indication to the >> runner to >> deal with a specific PCollection. >> >> Imagine, the user has a pipeline like this: >> pipeline.apply().apply().apply(). We >> have three PCollections involved in this >> pipeline. >> It's not >> currently possible >> to give indications how the runner should >> "optimized" and deal with >> the second >> PCollection only. >> >> The idea is to add a method on the >> PCollection: >> >> PCollection.addHint(String key, Object >> value); >> >> For instance: >> >> collection.addHint("spark.persist", >> StorageLevel.MEMORY_ONLY); >> >> I see three direct usage of this: >> >> 1. Related to schema: the schema >> definition could >> be a hint >> 2. Related to the IO: add headers for the >> IO and >> the runner how to >> specifically >> process a collection. In Apache Camel, we >> have >> headers on the >> message and >> properties on the exchange similar to this. >> It >> allows to give some >> indication >> how to process some messages on the Camel >> component. We can imagine >> the same of >> the IO (using the PCollection hints to react >> accordingly). >> 3. Related to runner optimization: I see for >> instance a way to use >> RDD or >> dataframe in Spark runner, or even specific >> optimization like >> persist. I had lot >> of questions from Spark users saying: "in >> my Spark >> job, I know where >> and how I >> should use persist (rdd.persist()), but I >> can't do >> such optimization >> using >> Beam". So it could be a good improvements. >> >> Thoughts ? >> >> Regards >> JB >> -- >> Jean-Baptiste Onofré >> jbono...@apache.org <mailto:jbono...@apache.org> >> <mailto:jbono...@apache.org <mailto:jbono...@apache.org>> >> <mailto:jbono...@apache.org >> <mailto:jbono...@apache.org> <mailto:jbono...@apache.org >> <mailto:jbono...@apache.org>>> >> http://blog.nanthrax.net >> Talend - http://www.talend.com >> >> >> >> >> >> >>