[ 
https://issues.apache.org/jira/browse/DRILL-92?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335238#comment-14335238
 ] 

Robert Stupp commented on DRILL-92:
-----------------------------------

I went over your patch just to see how the actual C* integration has been 
implemented. Tbh - I don’t know how Drill works - but I know how C* and the 
Java Driver work.

Please let me explain some things in advance. A Cassandra cluster exists of 
many nodes. Some of them might be down without affecting the integrity of the 
whole cluster. Due to the fact that some hosts might be down, the Datastax Java 
Driver allows to specify *multiple* initial contact points - and multiple 
initial contact points (maybe 3 per data center) should be passed to 
{{Cluster.Builder}}. All connections to a C* cluster are managed by the 
{{Cluster}} instance - not directly by {{Session}}. That means: to effectively 
close connections to a cluster, you have to close the {{Cluster}} instance. 
Further, the {{Cluster}} instance learns about all other nodes in the C* 
cluster - i.e. it will know all nodes in the cluster, which token ranges they 
server, and it will perform a best-effort-approach to route direct DML 
statements (SELECT/INSERT/UPDATE/DELETE) to the nodes that hold replicas of 
them. A usual application does not care about where data actually lives - it’s 
handled by the Java Driver for you.

The lines in {{CassandraGroupScan}} calling 
{{com.datastax.driver.core.Metadata#getReplicas}} are wrong. Which nodes are 
replicas for a keyspace are defined by the replication strategy and the 
per-keyspace configuration. The method you’re calling determines the hosts for 
a specific _partition key_ - but you’re passing in the class name of 
partitioner. Those are completely different things.
Although not completely wrong, I’d encourage you not to assume which nodes hold 
the tokens you intend to request (in {{CassandraUtil}}). There are several 
other things that influence where data ”lives” - e.g. datacenter and rack 
awareness.

In {{CassandraSchemaFactory}} is a keyspace cache and a table cache. That’s 
completely superfluous since the Java Driver already holds that information in 
the {{Cluster}} instance and it gets automagically updated when the cluster 
topology and/or the schema changes. That kind of metadata is essential for the 
Java Driver to work and always present.

I’d recommend to start with a different approach and consider the current patch 
as a _proof-of-concept_ (you may of course take over working code):
# Learn a bit more about C* and the Java Driver architecture ;)
# Forget about accessing the ”nearest” node in an initial attempt - you can add 
that later anyway. BTW that does only make sense, if you have Drill slaves 
(don’t know if such exist) running on each C* node.
# Start with a simple cluster to work against. Take a look at _ccm_ - it’s a 
neat tool that spawns a C* using multiple nodes on your local machine: 
https://github.com/pcmanus/ccm/.
# If you have a basic implementation running you may improve it by adding 
datacenter-awareness to your client (it’s basically just a simple configuration 
using {{Cluster.Builder}}, authentication against the C* cluster and some other 
fine tuning

Feel free to ask questions on C* user mailing list [email protected] or 
on freenode IRC #cassandra. There are many people happy to answer individual 
questions. Just ask - don’t ask to ask :)


> Cassandra storage engine
> ------------------------
>
>                 Key: DRILL-92
>                 URL: https://issues.apache.org/jira/browse/DRILL-92
>             Project: Apache Drill
>          Issue Type: New Feature
>            Reporter: Steven Phillips
>            Assignee: Yash Sharma
>             Fix For: Future
>
>         Attachments: DRILL-92.patch, DRILL-CASSANDRA.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to