[
https://issues.apache.org/jira/browse/DRILL-92?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335238#comment-14335238
]
Robert Stupp commented on DRILL-92:
-----------------------------------
I went over your patch just to see how the actual C* integration has been
implemented. Tbh - I don’t know how Drill works - but I know how C* and the
Java Driver work.
Please let me explain some things in advance. A Cassandra cluster exists of
many nodes. Some of them might be down without affecting the integrity of the
whole cluster. Due to the fact that some hosts might be down, the Datastax Java
Driver allows to specify *multiple* initial contact points - and multiple
initial contact points (maybe 3 per data center) should be passed to
{{Cluster.Builder}}. All connections to a C* cluster are managed by the
{{Cluster}} instance - not directly by {{Session}}. That means: to effectively
close connections to a cluster, you have to close the {{Cluster}} instance.
Further, the {{Cluster}} instance learns about all other nodes in the C*
cluster - i.e. it will know all nodes in the cluster, which token ranges they
server, and it will perform a best-effort-approach to route direct DML
statements (SELECT/INSERT/UPDATE/DELETE) to the nodes that hold replicas of
them. A usual application does not care about where data actually lives - it’s
handled by the Java Driver for you.
The lines in {{CassandraGroupScan}} calling
{{com.datastax.driver.core.Metadata#getReplicas}} are wrong. Which nodes are
replicas for a keyspace are defined by the replication strategy and the
per-keyspace configuration. The method you’re calling determines the hosts for
a specific _partition key_ - but you’re passing in the class name of
partitioner. Those are completely different things.
Although not completely wrong, I’d encourage you not to assume which nodes hold
the tokens you intend to request (in {{CassandraUtil}}). There are several
other things that influence where data ”lives” - e.g. datacenter and rack
awareness.
In {{CassandraSchemaFactory}} is a keyspace cache and a table cache. That’s
completely superfluous since the Java Driver already holds that information in
the {{Cluster}} instance and it gets automagically updated when the cluster
topology and/or the schema changes. That kind of metadata is essential for the
Java Driver to work and always present.
I’d recommend to start with a different approach and consider the current patch
as a _proof-of-concept_ (you may of course take over working code):
# Learn a bit more about C* and the Java Driver architecture ;)
# Forget about accessing the ”nearest” node in an initial attempt - you can add
that later anyway. BTW that does only make sense, if you have Drill slaves
(don’t know if such exist) running on each C* node.
# Start with a simple cluster to work against. Take a look at _ccm_ - it’s a
neat tool that spawns a C* using multiple nodes on your local machine:
https://github.com/pcmanus/ccm/.
# If you have a basic implementation running you may improve it by adding
datacenter-awareness to your client (it’s basically just a simple configuration
using {{Cluster.Builder}}, authentication against the C* cluster and some other
fine tuning
Feel free to ask questions on C* user mailing list [email protected] or
on freenode IRC #cassandra. There are many people happy to answer individual
questions. Just ask - don’t ask to ask :)
> Cassandra storage engine
> ------------------------
>
> Key: DRILL-92
> URL: https://issues.apache.org/jira/browse/DRILL-92
> Project: Apache Drill
> Issue Type: New Feature
> Reporter: Steven Phillips
> Assignee: Yash Sharma
> Fix For: Future
>
> Attachments: DRILL-92.patch, DRILL-CASSANDRA.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)