Re: [DISCUSSION] Add hint/option on PCollection

Lukasz Cwik Tue, 30 Jan 2018 09:35:24 -0800

If the hint is required to run the persons pipeline well, how do you expect
that the person we be able to migrate their pipeline to another runner?


A lot of hints like "spark.persist" are really the user trying to tell us
something about the PCollection, like it is very small. I would prefer if
we gathered this information about PTransforms and PCollections instead of
runner specific knobs since then each runner can choose how best to map
such a property on their internal representation.

On Tue, Jan 30, 2018 at 2:21 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi,
>
> As part of the discussion about schema, Romain mentioned hint. I think it's
> worth to have an explanation about that and especially it could be wider
> than
> schema.
>
> Today, to give information to the runner, we use PipelineOptions. The
> runner can
> use these options, and apply for all inner representation of the
> PCollection in
> the runner.
>
> For instance, for the Spark runner, the persistence storage level (memory,
> disk,
> ...) can be defined via pipeline options.
>
> Then, the Spark runner automatically defines if RDDs have to be persisted
> (using
> the storage level defined in the pipeline options), for instance if the
> same
> POutput/PCollection is read several time.
>
> However, the user doesn't have any way to provide indication to the runner
> to
> deal with a specific PCollection.
>
> Imagine, the user has a pipeline like this: pipeline.apply().apply().apply().
> We
> have three PCollections involved in this pipeline. It's not currently
> possible
> to give indications how the runner should "optimized" and deal with the
> second
> PCollection only.
>
> The idea is to add a method on the PCollection:
>
> PCollection.addHint(String key, Object value);
>
> For instance:
>
> collection.addHint("spark.persist", StorageLevel.MEMORY_ONLY);
>
> I see three direct usage of this:
>
> 1. Related to schema: the schema definition could be a hint
> 2. Related to the IO: add headers for the IO and the runner how to
> specifically
> process a collection. In Apache Camel, we have headers on the message and
> properties on the exchange similar to this. It allows to give some
> indication
> how to process some messages on the Camel component. We can imagine the
> same of
> the IO (using the PCollection hints to react accordingly).
> 3. Related to runner optimization: I see for instance a way to use RDD or
> dataframe in Spark runner, or even specific optimization like persist. I
> had lot
> of questions from Spark users saying: "in my Spark job, I know where and
> how I
> should use persist (rdd.persist()), but I can't do such optimization using
> Beam". So it could be a good improvements.
>
> Thoughts ?
>
> Regards
> JB
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [DISCUSSION] Add hint/option on PCollection

Reply via email to