Re: Sink API question

Chet Aldrich Fri, 17 Nov 2017 10:17:06 -0800

Hey JB, 

I wasn’t really thinking about open-sourcing the I/O transform with Beam in 
this case because of the API’s proprietary nature, but then again I suppose 
that BigQuery and other Google services are similarly proprietary and have 
transforms included with Beam.


If you feel like it’d be a worthwhile addition, I can ask some people that work 
there about the licensing and see if this is something that could be done. 

Chet

> On Nov 16, 2017, at 9:08 PM, Jean-Baptiste Onofré <[email protected]> wrote:
> 
> Hi,
> 
> if you take a look on existing IO, most of them doesn't use the Sink API: 
> they implement a Sink using a DoFn.
> 
> I think Algolia would be the same for the Write. What do you think about 
> updating the index when we finalize a bundle ?
> 
> NB: what's the Algolia "client/API" license ? Just to double check that it 
> can be part of Beam.
> 
> Regards
> JB
> 
> On 11/17/2017 02:28 AM, Chet Aldrich wrote:
>> Hello all,
>> I’m in the process of implementing a way to write data using a PTransform to 
>> Algolia (https://www.algolia.com/ <https://www.algolia.com/>). However, in 
>> the process of doing so I’ve run into a bit of a snag, and was curious if 
>> someone here would be able to help me figure this out.
>> Building a DoFn that can accomplish this is straightforward, since it 
>> essentially just involves creating a bundle of values and then flushing the 
>> batch out to Algolia using their API client as needed.
>> However, I’d like to perform the changes to the index atomically, that is, 
>> to either write all of the data or none of the data in the event of a 
>> pipeline failure. This can be done in Algolia by moving a temporary index on 
>> top of an existing one, like they do here: 
>> https://www.algolia.com/doc/tutorials/indexing/synchronization/atomic-reindexing/
>>  
>> <https://www.algolia.com/doc/tutorials/indexing/synchronization/atomic-reindexing/>
>> This is where it gets a bit more tricky. I noted that there exists a 
>> @Teardown annotation that allows one to do something like close the client 
>> when the DoFn is complete on a given machine, but it doesn’t quite do what I 
>> want.
>> In theory, I’d like to write to a temporary index, and then when the 
>> transform has been performed on all elements, I then move the index over, 
>> completing the operation.
>> I previous implemented this functionality using the Beam Python SDK using 
>> the Sink class described here: 
>> https://beam.apache.org/documentation/sdks/python-custom-io/ 
>> <https://beam.apache.org/documentation/sdks/python-custom-io/>
>> I’m making the transition to the Java SDK because of the built in JDBC I/O 
>> transform. However, I’m finding that this Sink API for java is proving 
>> elusive, and digging around hasn’t proved fruitful. Specifically, I was 
>> looking at this page and it seems like it was directing me to ask here if 
>> I’m not sure whether the functionality I desire can be implemented with a 
>> DoFn: 
>> https://beam.apache.org/documentation/io/authoring-overview/#when-to-implement-using-the-sink-api
>>  
>> <https://beam.apache.org/documentation/io/authoring-overview/#when-to-implement-using-the-sink-api>
>>     
>> Is there something that can do something similar to what I just described? 
>> If there’s something I just missed while digging through the DoFn 
>> documentation that’d be great, but I didn’t see anything.
>> Best,
>> Chet
> 
> -- 
> Jean-Baptiste Onofré
> [email protected]
> http://blog.nanthrax.net
> Talend - http://www.talend.com

Re: Sink API question

Reply via email to