If we implement MetricsPusher to run from the local job submission JVM for
Dataflow jobs and the JVM dies: the Dataflow job would continue to
completion, but the MetricsPusher would not restart, so the exported
metrics would be stale.

I like that the Spark implementation has the ability to elect and manage a
driver worker in the cluster. Dataflow doesn't have such functionality,
although it would be useful. One way to emulate it would be Reuven's
suggestion to add a synthetic ParDo + timer to the graph.

On Thu, Oct 4, 2018 at 5:28 AM Etienne Chauchot <[email protected]>
wrote:

> Well if we add the code in DataflowPipelineJob, then it would issue REST
> calls to the engine no ? And also, it will be far from the actual execution
> engine. Was is not what we wanted to avoid ?
>
> Regarding the driver, if I do an analogy with spark, even if it is not a
> cloud hosted runner (driver in spark is close to the execution engine): The
> Metrics pusher polling thread indeed lives inside the driver program
> context. This program can be either hosted in the submission client process
> (when pipeline is submitted in spark client mode) or in an elected driver
> worker in the cluster (when pipeline is submitted in spark cluster mode).
> In the latter case, the cluster manager such as yarn or mesos can relaunch
> the driver program if the worker dies.
>
> Etienne
>
> Le mercredi 03 octobre 2018 à 16:49 -0700, Scott Wegner a écrit :
>
> Another point that we discussed at ApacheCon is that a difference between
> Dataflow and other runners is Dataflow is service-based and doesn't need a
> locally executing "driver" program. A local driver context is a good place
> to implement MetricsPusher because it is a singleton process.
>
> In fact, DataflowRunner supports PipelineResult.waitUntilFinish() [1],
> where we do maintain the local JVM context. Currently in this mode the
> runner polls the Dataflow service API for log messages [2]. It would be
> very easy to also poll for metric updates and push them out via
> MetricsPusher.
>
> [1]
> https://github.com/apache/beam/blob/279a05604b83a54e8e5a79e13d8761f94841f326/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineJob.java#L169
> [2]
> https://github.com/apache/beam/blob/279a05604b83a54e8e5a79e13d8761f94841f326/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineJob.java#L291
>
> On Wed, Oct 3, 2018 at 4:44 AM Etienne Chauchot <[email protected]>
> wrote:
>
> Hi Scott,
> Thanks for the update.
> Both solutions look good to me. Though, they both have plus and minus. I
> let the googlers chose which is more appropriate:
>
> - DAG modifcation: less intrusive in Dataflow but the DAG executed and
> shown in the DAG UI in dataflow will contain an extra step that the user
> might wonder about.
> - polling thread: it is exactly what I did for the other runners, it is
> more transparent to the user but requires more infra work (adds a container
> that needs to be resilient)
>
> Best
> Etienne
>
> Le vendredi 21 septembre 2018 à 12:46 -0700, Scott Wegner a écrit :
>
> Hi Etienne, sorry for the delay on this. I just got back from leave and
> found this discussion.
>
> We haven't started implementing MetricsPusher in the Dataflow runner,
> mostly because the Dataflow service has it's own rich Metrics REST API and
> we haven't heard a need from Dataflow customers to push metrics to an
> external backend. However, it would be nice to have this implemented across
> all runners for feature parity.
>
> I read through the discussion in JIRA [1], and the simplest implementation
> for Dataflow may be to have a single thread periodically poll the Dataflow
> REST API [2] for latest metric values, and push them to a configured sink.
> This polling thread could be hosted in a separate docker container, within
> the worker process, or perhaps a ParDo with timers that gets injected into
> the pipeline during graph translation.
>
> At any rate, I'm not aware of anybody currently working on this. But with
> the Dataflow worker code being donated to Beam [3], soon it will be
> possible for anybody to contribute.
>
> [1] https://issues.apache.org/jira/browse/BEAM-3926
> [2]
> https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.jobs/getMetrics
> [3]
> https://lists.apache.org/thread.html/2bdc645659e2fbd7e29f3a2758941faefedb01148a2a11558dfe60f8@%3Cdev.beam.apache.org%3E
>
> On Fri, Aug 17, 2018 at 4:26 PM Lukasz Cwik <[email protected]> wrote:
>
> I forwarded your request to a few people who work on the internal parts of
> Dataflow to see if they could help in some way.
>
> On Thu, Aug 16, 2018 at 6:22 AM Etienne Chauchot <[email protected]>
> wrote:
>
> Hi all
>
> As we already discussed, it would be good to support Metrics Pusher [1] in
> Dataflow (in other runners also, of course). Today, only Spark and Flink
> support it. It requires a modification in C++ Dataflow code, so only Google
> friends can do it.
>
> Is someone interested in doing it ?
>
> Here is the ticket https://issues.apache.org/jira/browse/BEAM-3926
>
> Besides, I wonder if this feature should be added to the capability matrix.
>
> [1]
> https://cwiki.apache.org/confluence/display/BEAM/Metrics+architecture+inside+the+runners
>
> Thanks
> Etienne
>
>
>
>
>
>

-- 




Got feedback? tinyurl.com/swegner-feedback

Reply via email to