Re: [I] Proposal to implement ADBC driver for Apache Cassandra [arrow-adbc]

via GitHub Sat, 26 Oct 2024 17:11:45 -0700


SChakravorti21 commented on issue #2245:
URL: https://github.com/apache/arrow-adbc/issues/2245#issuecomment-2439775399


   @paleolimbot @lidavidm Thank you so much for the detailed guidance and 
insights! My apologies for not being able to respond sooner. I was pretty busy 
for the past couple of weeks but my schedule has freed up a bit now :)
   
   > I'm hoping to finish it this week, but there's a work-in-progress tutorial 
of how to get started building a driver in C++ using nanoarrow/the framework 
here!
   
   Perfect timing! I'll be sure to check it out and share any feedback on the 
tutorial.
   
   > Arrow C++ presents a packaging problem (e.g., difficult/impossible to make 
an R driver wrapper, Python wrapper would require pinning a version of pyarrow 
until we sort out how to put two different Arrow C++ versions in the same 
process), which is why Matt probably recommended nanoarrow.
   
   All of that makes sense, thanks for clarifying. I think we've had trouble 
using Arrow C++ in Python extensions for similar reasons, so I'm totally 
onboard with using nanoarrow.
   
   > It's a bit subjective, but all our existing drivers lean on the most 
arrowish SDK available for the driver... I have no idea what Cassandra 
provides, but if it had a fairly complete Go or Rust client already and nothing 
for C++ that might be a good reason to implement it in those languages.
   
   These are good points! From some quick searching around, these are the main 
drivers I found in each language:
   
   - C++: [DataStax C/C++ Driver](https://github.com/datastax/cpp-driver). 
Although it doesn't seem to have much active development, we've been using it 
for a while in production and it is generally performant and reliable.
   
   - Go: [Cassandra GoCQL 
Driver](https://github.com/apache/cassandra-gocql-driver). Looks like this 
initially started as an independent project and was recently (as of this year) 
donated to Cassandra. I've noticed that ScyllaDB is maintaining their own 
[fork](https://github.com/scylladb/gocql) of this driver, which appears to be 
much more actively maintained.
   
   - Rust: the major ones I'm aware of are 
[cassandra-rs](https://github.com/cassandra-rs/cassandra-rs), 
[cdrs-tokio](https://github.com/krojew/cdrs-tokio), and [ScyllaDB's Rust 
driver](https://github.com/scylladb/scylla-rust-driver). My understanding 
(based on what I've learned from my colleagues) is that Scylla's driver is the 
most performant/reliable right now.
   
   A couple of other random observations/thoughts:
   
   - Cassandra's [list of documented 
drivers](https://cassandra.apache.org/doc/4.0.3/cassandra/getting_started/drivers.html)
 seems to suggest that DataStax owns/maintains quite a lot of the major drivers.
   
   - If the hope is to eventually donate this ADBC driver to the Cassandra 
project, maybe using one of the drivers they maintain (like GoCQL) would make 
it a less controversial suggestion.
   
   The choice does seem like a bit of a toss-up. Like you said, I could be of 
more help if we pursued this in C++, so I'm still leaning towards that approach.
   
   To address David's point about dependency management, the C/C++ driver 
[lists the following 
dependencies](https://docs.datastax.com/en/developer/cpp-driver/2.16/topics/building/index.html#dependencies):
   
   > - [CMake](http://www.cmake.org/download) v2.6.4+
   > - [libuv](http://libuv.org/) 1.x
   > - Kerberos v5 ([Heimdal](https://github.com/heimdal/heimdal) or 
[MIT](http://web.mit.edu/kerberos)) *
   > - [OpenSSL](https://www.openssl.org/) v1.0.x or v1.1.x **
   > - [zlib](https://www.zlib.net/) v1.x ***
   
   Certainly no gRPC, but please let me know if any of these seem problematic.
   
   > We have some `docker compose` services for databases for this purposes. 
You could do a PR first that makes it so that we can do `docker compose up 
apache-cassandra-test`.
   
   Sounds good to me! It should be doable, Cassandra does publish official 
Docker images: https://hub.docker.com/_/cassandra.
   
   > The Postgres driver has an example of writing tests for this without a 
live connection to the database (the "copy" tests). No pressure to do it 
exactly like that but I found it useful to accelerate the process of adding 
full type support there.
   
   I've been wondering how to approach this as well, so I'll definitely take a 
look at that. Appreciate the pointer.
   
   > Where to put it is a good thing to think about... ideally we'd (maybe just 
speaking for me here) like for ADBC connectors to live with the project instead 
of with us to spread out the maintenance load...
   
   I agree, it would be nice if the ADBC driver lived under the Cassandra 
project. This was also discussed in the most recent Arrow community call. The 
main hurdle seems to be that Cassandra is generally used for more OLTP-style 
workloads, so it may not be clear to them how Arrow fits into the picture or 
why an ADBC driver is necessary. I think starting in this repo to prove out the 
idea and then approaching the Cassandra community would be a viable strategy.
   
   > Feel free to ping me early and often as you get started (probably 
everybody else is game too, but I'll let the volunteer themselves 🙂).
   
   Thank you! I'll definitely reach out as necessary and plan to use draft PRs 
to get early feedback (if that's ok).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Proposal to implement ADBC driver for Apache Cassandra [arrow-adbc]

Reply via email to