Hi Chet,
It sounds like you want the following pattern:
- Write data in parallel
- Once all parallel writes have completed, gather their results and issue a
commit
The sink API once used to do something like that, but it turned out that
the only thing that mapped well onto that API was files; and other things
were either much more complex than the API could express
(BigQueryIO.write()) or much simpler (regular DoFn's). So we removed the
API, because it was useful only for one thing (files) and because it kept
repeatedly confusing people into thinking they need to use it ("I'm writing
to a storage system, I probably should use the Sink API"), whereas in
nearly 100% of the causes they didn't.
However, your use case maps well onto the original goal of Sink. I'd
recommend either looking at how WriteFiles and BigQueryIO.write() work (be
warned: they are very complex) or looking at how the original Sink API
worked. There was nothing special about it, it was just a composite
transform that anyone can write in their pipeline without adding anything
to Beam itself. See original code at 1.8.0:
https://github.com/apache/beam/blob/v1.8.0/sdk/src/main/java/com/google/cloud/dataflow/sdk/io/Write.java#L290
-
draw inspiration from it and implement your transform in a similar way :)
(though you'll be able to do it much simpler, because you're implementing
an individual case rather than a general-purpose API)
On Thu, Nov 16, 2017 at 5:28 PM Chet Aldrich <[email protected]>
wrote:
> Hello all,
>
> I’m in the process of implementing a way to write data using a PTransform
> to Algolia (https://www.algolia.com/ <https://www.algolia.com/>).
> However, in the process of doing so I’ve run into a bit of a snag, and was
> curious if someone here would be able to help me figure this out.
>
> Building a DoFn that can accomplish this is straightforward, since it
> essentially just involves creating a bundle of values and then flushing the
> batch out to Algolia using their API client as needed.
>
> However, I’d like to perform the changes to the index atomically, that is,
> to either write all of the data or none of the data in the event of a
> pipeline failure. This can be done in Algolia by moving a temporary index
> on top of an existing one, like they do here:
> https://www.algolia.com/doc/tutorials/indexing/synchronization/atomic-reindexing/
> <
> https://www.algolia.com/doc/tutorials/indexing/synchronization/atomic-reindexing/
> >
>
> This is where it gets a bit more tricky. I noted that there exists a
> @Teardown annotation that allows one to do something like close the client
> when the DoFn is complete on a given machine, but it doesn’t quite do what
> I want.
>
> In theory, I’d like to write to a temporary index, and then when the
> transform has been performed on all elements, I then move the index over,
> completing the operation.
>
> I previous implemented this functionality using the Beam Python SDK using
> the Sink class described here:
> https://beam.apache.org/documentation/sdks/python-custom-io/ <
> https://beam.apache.org/documentation/sdks/python-custom-io/>
>
> I’m making the transition to the Java SDK because of the built in JDBC I/O
> transform. However, I’m finding that this Sink API for java is proving
> elusive, and digging around hasn’t proved fruitful. Specifically, I was
> looking at this page and it seems like it was directing me to ask here if
> I’m not sure whether the functionality I desire can be implemented with a
> DoFn:
> https://beam.apache.org/documentation/io/authoring-overview/#when-to-implement-using-the-sink-api
> <
> https://beam.apache.org/documentation/io/authoring-overview/#when-to-implement-using-the-sink-api
> >
>
> Is there something that can do something similar to what I just described?
> If there’s something I just missed while digging through the DoFn
> documentation that’d be great, but I didn’t see anything.
>
> Best,
>
> Chet
>
>
>
>