Hi Charles,

Very cool!

If you hold onto the Cassandra connection as a static on your storage plugin, 
you can use it easily enough, but there is no lifecycle management. We can 
probably invent some mechanism to ensure that the Cassandra connection is 
properly closed after some time and at Drillbit exit. Seems like a good general 
mechanism.

I agree that using the filter pushdown stuff from the Base plugin will be far 
easier. Just code up your logic rather than having to make yet another copy of 
the big wad of code that tries to parse Drill expressions.

For data type conversion we have code in EVF, used by the Text reader, to 
convert from Strings to other types. To do conversions, you have to know the 
destination type. Does Cassandra provide some schema you can work against? In 
the Text reader, we use the "provided schema" which Arina implemented.


After seeing how you and Arina made good use of "shim" classes to handle column 
conversions, I think that is probably the better long-term solution instead of 
the EVF column converters. By using shims, any quirky conversion code can be in 
the plugin. I'll be interested to see what you do for Cassandra.

On the other push-downs, would be great if you could identify common 
expression-tree code we can wrap into something in the "Base" framework. I'd 
guess that all plugins have to do more-or-less the same Calcite stuff to handle 
Limit and Aggregates. Limit may be easier; aggregates require the same kind of 
analysis as filter push-downs. For example, you can probably push-down 
COUNT(*), SUM(foo) and so on, but not an aggregate of a Drill function.

Thanks for writing the documentation. In my experience working with Presto, 
there is a huge difference in productivity when documentation is available, and 
when I have to figure things out by reading (undocumented) code.


Thanks,
- Paul

 

    On Sunday, January 19, 2020, 3:26:57 PM PST, Charles Givre 
<[email protected]> wrote:  
 
 Hey Paul, 
I was messing with the Cassandra plugin last night and moved the connection 
logic to the actual StoragePlugin class.  This, combined with a few null checks 
seemed to do the trick as queries are now virtually instantaneous!

What remains to be done:
1.  Filter pushdown not working:  I'm going to wait until the Base Storage PR 
is committed and attempt to use that.  This plugin seems like a really obvious 
candidate for that.  I've been digging around to see if I can figure out how to 
use the Calcite adapters and there is nothing out there WRT documentation or 
example code.  I saw that the Drill JDBC storage plugin uses the Calcite 
adapter so I may try to follow that model.

2.  Fix data types:  Right now, the plugin returns everything as a string.  
Obviously, that needs to get fixed, so I'll need to rewrite the RecordReader 
class to use EVF. 
3.  Other push downs:  This seems like a really good candidate for Limit and 
Aggregate push downs as well.  If I can figure out how to do that and/or use 
the Calcite adapter to do so, I'll work on that.  
4.  Write documentation and additional configuration options:

If we can get the Base Storage PR committed, my goal is to get this ready for 
Drill 1.18. This may be a bit of a stretch, but we'll see.  If anyone is 
interested, here is a link to my branch[1].  Feedback is definitely 
appreciated, but in no way is this ready for code review.
Best,
-- C


[1]: https://github.com/cgivre/drill/tree/storage-cassandra 
<https://github.com/cgivre/drill/tree/storage-cassandra>



> On Jan 17, 2020, at 5:37 PM, Paul Rogers <[email protected]> wrote:
> 
> Hi Charles,
> 
> Poked around a bit. Turns out that the Cassandra client seems to work a bit 
> differently than a JDBC client. From the JavaDoc page: "Session instances are 
> thread-safe and usually a single instance is enough per application." Given 
> this, you might be able to cache a single connection (per keyspace) to be 
> shared by the planner and scans. [1]
> 
> Still need some global object to open, maintain and close the connection, so 
> something would have to be added to Drill to support this.
> 
> JDBC is harder to work with because connection access must be serialized: 
> only one thread can use the connection at a time. More to the point, 
> transactions must be serialized; JDBC can't support two separate connections 
> on a single JDBC connection.
> 
> 
> Thanks,
> - Paul
> 
> 
> [1] 
> https://docs.datastax.com/en/drivers/java/2.0/com/datastax/driver/core/Session.html
> 
> 
> 
> 
>    On Friday, January 17, 2020, 04:56:39 AM PST, Charles Givre 
><[email protected]> wrote:  
> 
> Hello Drill Devs
> I have a question for you.  I'm working on a storage plugin for Apache 
> Cassandra.  I've got the queries mostly working, but I have a question.  
> Connections to Cassandra are meant to be opened once and remain open and so 
> they are "heavy".  It takes about 2 seconds to connect to the Cassandra 
> instance on my local machine.  Once the connection happens, the queries are 
> very fast.  I'm wondering is there a way to open the connection once and have 
> it persist somehow so that we don't have that overhead for each query?
> 
> I seem to recall a similar discussion for the JDBC storage plugin.
> Thanks,
> -- C
  

Reply via email to