SChakravorti21 commented on issue #2245: URL: https://github.com/apache/arrow-adbc/issues/2245#issuecomment-2439775399
@paleolimbot @lidavidm Thank you so much for the detailed guidance and insights! My apologies for not being able to respond sooner. I was pretty busy for the past couple of weeks but my schedule has freed up a bit now :) > I'm hoping to finish it this week, but there's a work-in-progress tutorial of how to get started building a driver in C++ using nanoarrow/the framework here! Perfect timing! I'll be sure to check it out and share any feedback on the tutorial. > Arrow C++ presents a packaging problem (e.g., difficult/impossible to make an R driver wrapper, Python wrapper would require pinning a version of pyarrow until we sort out how to put two different Arrow C++ versions in the same process), which is why Matt probably recommended nanoarrow. All of that makes sense, thanks for clarifying. I think we've had trouble using Arrow C++ in Python extensions for similar reasons, so I'm totally onboard with using nanoarrow. > It's a bit subjective, but all our existing drivers lean on the most arrowish SDK available for the driver... I have no idea what Cassandra provides, but if it had a fairly complete Go or Rust client already and nothing for C++ that might be a good reason to implement it in those languages. These are good points! From some quick searching around, these are the main drivers I found in each language: - C++: [DataStax C/C++ Driver](https://github.com/datastax/cpp-driver). Although it doesn't seem to have much active development, we've been using it for a while in production and it is generally performant and reliable. - Go: [Cassandra GoCQL Driver](https://github.com/apache/cassandra-gocql-driver). Looks like this initially started as an independent project and was recently (as of this year) donated to Cassandra. I've noticed that ScyllaDB is maintaining their own [fork](https://github.com/scylladb/gocql) of this driver, which appears to be much more actively maintained. - Rust: the major ones I'm aware of are [cassandra-rs](https://github.com/cassandra-rs/cassandra-rs), [cdrs-tokio](https://github.com/krojew/cdrs-tokio), and [ScyllaDB's Rust driver](https://github.com/scylladb/scylla-rust-driver). My understanding (based on what I've learned from my colleagues) is that Scylla's driver is the most performant/reliable right now. A couple of other random observations/thoughts: - Cassandra's [list of documented drivers](https://cassandra.apache.org/doc/4.0.3/cassandra/getting_started/drivers.html) seems to suggest that DataStax owns/maintains quite a lot of the major drivers. - If the hope is to eventually donate this ADBC driver to the Cassandra project, maybe using one of the drivers they maintain (like GoCQL) would make it a less controversial suggestion. The choice does seem like a bit of a toss-up. Like you said, I could be of more help if we pursued this in C++, so I'm still leaning towards that approach. To address David's point about dependency management, the C/C++ driver [lists the following dependencies](https://docs.datastax.com/en/developer/cpp-driver/2.16/topics/building/index.html#dependencies): > - [CMake](http://www.cmake.org/download) v2.6.4+ > - [libuv](http://libuv.org/) 1.x > - Kerberos v5 ([Heimdal](https://github.com/heimdal/heimdal) or [MIT](http://web.mit.edu/kerberos)) * > - [OpenSSL](https://www.openssl.org/) v1.0.x or v1.1.x ** > - [zlib](https://www.zlib.net/) v1.x *** Certainly no gRPC, but please let me know if any of these seem problematic. > We have some `docker compose` services for databases for this purposes. You could do a PR first that makes it so that we can do `docker compose up apache-cassandra-test`. Sounds good to me! It should be doable, Cassandra does publish official Docker images: https://hub.docker.com/_/cassandra. > The Postgres driver has an example of writing tests for this without a live connection to the database (the "copy" tests). No pressure to do it exactly like that but I found it useful to accelerate the process of adding full type support there. I've been wondering how to approach this as well, so I'll definitely take a look at that. Appreciate the pointer. > Where to put it is a good thing to think about... ideally we'd (maybe just speaking for me here) like for ADBC connectors to live with the project instead of with us to spread out the maintenance load... I agree, it would be nice if the ADBC driver lived under the Cassandra project. This was also discussed in the most recent Arrow community call. The main hurdle seems to be that Cassandra is generally used for more OLTP-style workloads, so it may not be clear to them how Arrow fits into the picture or why an ADBC driver is necessary. I think starting in this repo to prove out the idea and then approaching the Cassandra community would be a viable strategy. > Feel free to ping me early and often as you get started (probably everybody else is game too, but I'll let the volunteer themselves 🙂). Thank you! I'll definitely reach out as necessary and plan to use draft PRs to get early feedback (if that's ok). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
