Hi Charles,
Very cool!
If you hold onto the Cassandra connection as a static on your storage plugin,
you can use it easily enough, but there is no lifecycle management. We can
probably invent some mechanism to ensure that the Cassandra connection is
properly closed after some time and at Drillbit exit. Seems like a good general
mechanism.
I agree that using the filter pushdown stuff from the Base plugin will be far
easier. Just code up your logic rather than having to make yet another copy of
the big wad of code that tries to parse Drill expressions.
For data type conversion we have code in EVF, used by the Text reader, to
convert from Strings to other types. To do conversions, you have to know the
destination type. Does Cassandra provide some schema you can work against? In
the Text reader, we use the "provided schema" which Arina implemented.
After seeing how you and Arina made good use of "shim" classes to handle column
conversions, I think that is probably the better long-term solution instead of
the EVF column converters. By using shims, any quirky conversion code can be in
the plugin. I'll be interested to see what you do for Cassandra.
On the other push-downs, would be great if you could identify common
expression-tree code we can wrap into something in the "Base" framework. I'd
guess that all plugins have to do more-or-less the same Calcite stuff to handle
Limit and Aggregates. Limit may be easier; aggregates require the same kind of
analysis as filter push-downs. For example, you can probably push-down
COUNT(*), SUM(foo) and so on, but not an aggregate of a Drill function.
Thanks for writing the documentation. In my experience working with Presto,
there is a huge difference in productivity when documentation is available, and
when I have to figure things out by reading (undocumented) code.
Thanks,
- Paul
On Sunday, January 19, 2020, 3:26:57 PM PST, Charles Givre
<[email protected]> wrote:
Hey Paul,
I was messing with the Cassandra plugin last night and moved the connection
logic to the actual StoragePlugin class. This, combined with a few null checks
seemed to do the trick as queries are now virtually instantaneous!
What remains to be done:
1. Filter pushdown not working: I'm going to wait until the Base Storage PR
is committed and attempt to use that. This plugin seems like a really obvious
candidate for that. I've been digging around to see if I can figure out how to
use the Calcite adapters and there is nothing out there WRT documentation or
example code. I saw that the Drill JDBC storage plugin uses the Calcite
adapter so I may try to follow that model.
2. Fix data types: Right now, the plugin returns everything as a string.
Obviously, that needs to get fixed, so I'll need to rewrite the RecordReader
class to use EVF.
3. Other push downs: This seems like a really good candidate for Limit and
Aggregate push downs as well. If I can figure out how to do that and/or use
the Calcite adapter to do so, I'll work on that.
4. Write documentation and additional configuration options:
If we can get the Base Storage PR committed, my goal is to get this ready for
Drill 1.18. This may be a bit of a stretch, but we'll see. If anyone is
interested, here is a link to my branch[1]. Feedback is definitely
appreciated, but in no way is this ready for code review.
Best,
-- C
[1]: https://github.com/cgivre/drill/tree/storage-cassandra
<https://github.com/cgivre/drill/tree/storage-cassandra>
> On Jan 17, 2020, at 5:37 PM, Paul Rogers <[email protected]> wrote:
>
> Hi Charles,
>
> Poked around a bit. Turns out that the Cassandra client seems to work a bit
> differently than a JDBC client. From the JavaDoc page: "Session instances are
> thread-safe and usually a single instance is enough per application." Given
> this, you might be able to cache a single connection (per keyspace) to be
> shared by the planner and scans. [1]
>
> Still need some global object to open, maintain and close the connection, so
> something would have to be added to Drill to support this.
>
> JDBC is harder to work with because connection access must be serialized:
> only one thread can use the connection at a time. More to the point,
> transactions must be serialized; JDBC can't support two separate connections
> on a single JDBC connection.
>
>
> Thanks,
> - Paul
>
>
> [1]
> https://docs.datastax.com/en/drivers/java/2.0/com/datastax/driver/core/Session.html
>
>
>
>
> On Friday, January 17, 2020, 04:56:39 AM PST, Charles Givre
><[email protected]> wrote:
>
> Hello Drill Devs
> I have a question for you. I'm working on a storage plugin for Apache
> Cassandra. I've got the queries mostly working, but I have a question.
> Connections to Cassandra are meant to be opened once and remain open and so
> they are "heavy". It takes about 2 seconds to connect to the Cassandra
> instance on my local machine. Once the connection happens, the queries are
> very fast. I'm wondering is there a way to open the connection once and have
> it persist somehow so that we don't have that overhead for each query?
>
> I seem to recall a similar discussion for the JDBC storage plugin.
> Thanks,
> -- C