I had not seen that the query API of Neptune is Gremlin based so this could be an even more generic IO connector. That's probably beyond scope because you care most for the write but interesting anyway.
https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-gremlin-java.html On Fri, Apr 16, 2021 at 9:58 AM Ismaël Mejía <[email protected]> wrote: > > Hello Gabriel, > > Other interesting reference because of the Batch loads API like use + > Amazon is the unfinished Amazon Redshift connector PR from this ticket > https://issues.apache.org/jira/browse/BEAM-3032 > > The reason why that one was not merged into Beam is because if lacked tests. > You should probably look at how to test Neptune in advance, it seems > that localstack does not support neptune (only on the paying version) > so probably mocking would be the right way. > > We will be really interested in case you want to contribute the > NeptuneIO connector into Beam so don't hesitate to contact us. > > > On Fri, Apr 16, 2021 at 5:41 AM Gabriel Levcovitz <[email protected]> > wrote: > > > > Hi Daniel, Kenneth, > > > > Thank you very much for your answers! I'll be looking carefully into the > > info you've provided and if we eventually decide it's worth implementing, > > I'll get back to you. > > > > Best, > > Gabriel > > > > > > On Thu, Apr 15, 2021 at 2:32 PM Kenneth Knowles <[email protected]> wrote: > >> > >> > >> > >> On Wed, Apr 14, 2021 at 11:07 PM Daniel Collins <[email protected]> > >> wrote: > >>> > >>> Hi Gabriel, > >>> > >>> Write-side adapters for systems tend to be easier than read-side adapters > >>> to implement. That being said, looking at the documentation for neptune, > >>> it looks to me like there's no direct data load API, only a batch data > >>> load from a file on S3? This is usable but perhaps a bit more difficult > >>> to work with. > >>> > >>> You could implement a write side adapter for neptune (either on your own > >>> or as a contribution to beam) by writing a standard DoFn which, in its > >>> ProcessElement method, buffers received records in memory, and in its > >>> FinishBundle method, writes all collected records to a file on S3, > >>> notifies neptune, and waits for neptune to ingest them. You can see > >>> documentation on the DoFn API here. Someone else here might have more > >>> experience working with microbatch-style APIs like this, and could have > >>> more suggestions. > >> > >> > >> In fact, our BigQueryIO connector has a mode of operation that does batch > >> loads from files on GCS: > >> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java > >> > >> The connector overall is large and complex, because it is old and mature. > >> But it may be helpful as a point of reference. > >> > >> Kenn > >> > >>> > >>> A read-side API would likely be only a minimally higher lift. This could > >>> be done in a simple loading step (Create with a single element followed > >>> by MapElements), although much of the complexity likely lies around how > >>> to provide the necessary properties to the cluster construction on the > >>> beam worker task, and how to define the query the user would need to > >>> execute. I'd also wonder if this could be done in an engine-agnostic way, > >>> "TinkerPopIO" instead of "NeptuneIO". > >>> > >>> If you'd like to pursue adding such an integration, > >>> https://beam.apache.org/contribute/ provides documentation on the > >>> contribution process. Contributions to beam are always appreciated! > >>> > >>> -Daniel > >>> > >>> > >>> > >>> On Thu, Apr 15, 2021 at 12:44 AM Gabriel Levcovitz > >>> <[email protected]> wrote: > >>>> > >>>> Dear Beam Dev community, > >>>> > >>>> I'm working on a project where we have a graph database on Amazon > >>>> Neptune (https://aws.amazon.com/neptune) and we have data coming from > >>>> Google Cloud. > >>>> > >>>> So I was wondering if anyone has ever worked with a similar architecture > >>>> and has considered developing an Amazon Neptune custom Beam I/O > >>>> connector. Is it feasible? Is it worth it? > >>>> > >>>> Honestly I'm not that experienced with Apache Beam / Dataflow, so I'm > >>>> not sure if something like that would make sense. Currently we're > >>>> connecting Beam to AWS Kinesis and AWS S3, and from there, to Neptune. > >>>> > >>>> Thank you all very much in advance! > >>>> > >>>> Best, > >>>> Gabriel Levcovitz
