Having the sink output something I think is a good option. My use case looks something like this: 1) Read a bunch of avro files (all with the same schema, but the schema is not known before hand) 2) Use the avro schema + data to generate bigtable mutations 3) write the mutations 4) *once all files are processed and written *-> update a "checkpoint" marker in another bigtable table, which also depends on the schema from (1).
I've hacked this up by making the flow in step 4 rely on an output from step 2 that goes through a GroupBy to ensure that all records are at least processed by step 2 before step 4 runs, but there's still a race condition between the last record being emitted by step 2 and the write in step 3 completing. If as you said, the sink emitted a record when it completed, that'd solve the race condition. In summary: right now the flow looks like this (terrible ASCII attempt): Read Avro Files (extract schema + data) (1) | V Generate mutations (2) |----------------------------> [GroupBy -> Take first -> Generate mutation -> Bigtable write] (4) V Write mutations (3) On Fri, Aug 25, 2017 at 11:53 AM, Eugene Kirpichov < [email protected]> wrote: > I'd like to know more about your both use cases, can you clarify? I think > making sinks output something that can be waited on by another pipeline > step is a reasonable request, but more details would help refine this > suggestion. > > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <[email protected]> > wrote: > > > Can you do this from the program that runs the Beam job, after job is > > complete (you might have to use a blocking runner or poll for the status > of > > the job) ? > > > > - Cham > > > > On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <[email protected]> > wrote: > > > > > I also have a similar use case (but with BigTable) that I feel like I > had > > > to hack up to make work. It'd be great to hear if there is a way to do > > > something like this already, or if there are plans in the future. > > > > > > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <[email protected]> > wrote: > > > > > > > Hi, > > > > I have a few piplines that are an ETL from different systems to > > > bigquery. > > > > I would like to write the status of the ETL after all records have > > > > been updated to the bigquery. > > > > The problem is that writing to bigquery is a sink and you cannot have > > > > any other steps after the sink. > > > > I tried a sideoutput, but this is called in no correlation to the > > > > writing to bigquery, so i don't know if it succeeded or failed. > > > > > > > > > > > > any ideas? > > > > chaim > > > > > > > > > >
