Hi,
I am trying to optimize my Apache Beam pipeline on Google Cloud Platform Dataflow, and I would really appreciate your help and advice. Background information: I am trying to read data from PubSub Messages, and aggregate them based on 3 time windows: 1 min, 5 min and 60 min. Such aggregations consists of summing, averaging, finding the maximum or minimum, etc. For example, for all data collected from 1200 to 1201, I want to aggregate them and write the output into BigTable's 1-min column family. And for all data collected from 1200 to 1205, I want to similarly aggregate them and write the output into BigTable's 5-min column. Same goes for 60min. The current approach I took is to have 3 separate dataflow jobs (i.e. 3 separate Beam Pipelines), each one having a different window duration (1min, 5min and 60min). See https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/transforms/windowing/Window.html. And the outputs of all 3 dataflow jobs are written to the same BigTable, but on different column families. Other than that, the function and aggregations of the data are the same for the 3 jobs. However, this seems to be very computationally inefficient, and cost inefficient, as the 3 jobs are essentially doing the same function, with the only exception being the window time duration and output column family. Some challenges and limitations we faced was that from the documentation, it seems like we are unable to create multiple windows of different periods in a singular dataflow job. Also, when we write the final data into big table, we would have to define the table, column family, column, and rowkey. And unfortunately, the column family is a fixed property (i.e. it cannot be redefined or changed given the window period). Hence, I am writing to ask if there is a way to only use 1 dataflow job that fulfils the objective of this project? Which is to aggregate data on different window periods, and write them to different column families of the same BigTable. Thank you