Help on Apache Beam Pipeline Optimization

Yi En Ong Tue, 04 Oct 2022 11:15:01 -0700

Hi,

I am trying to optimize my Apache Beam pipeline on Google Cloud Platform
Dataflow, and I would really appreciate your help and advice.

Background information: I am trying to read data from PubSub Messages, and
aggregate them based on 3 time windows: 1 min, 5 min and 60 min. Such
aggregations consists of summing, averaging, finding the maximum or
minimum, etc. For example, for all data collected from 1200 to 1201, I want
to aggregate them and write the output into BigTable's 1-min column family.
And for all data collected from 1200 to 1205, I want to similarly aggregate
them and write the output into BigTable's 5-min column. Same goes for 60min.

The current approach I took is to have 3 separate dataflow jobs (i.e. 3
separate Beam Pipelines), each one having a different window duration
(1min, 5min and 60min). See
https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/transforms/windowing/Window.html.
And the outputs of all 3 dataflow jobs are written to the same BigTable,
but on different column families. Other than that, the function and
aggregations of the data are the same for the 3 jobs.

However, this seems to be very computationally inefficient, and cost
inefficient, as the 3 jobs are essentially doing the same function, with
the only exception being the window time duration and output column family.

Some challenges and limitations we faced was that from the documentation,
it seems like we are unable to create multiple windows of different periods
in a singular dataflow job. Also, when we write the final data into big
table, we would have to define the table, column family, column, and
rowkey. And unfortunately, the column family is a fixed property (i.e. it
cannot be redefined or changed given the window period).

Hence, I am writing to ask if there is a way to only use 1 dataflow job
that fulfils the objective of this project? Which is to aggregate data on
different window periods, and write them to different column families of
the same BigTable.

Thank you

Help on Apache Beam Pipeline Optimization

Reply via email to