[jira] [Commented] (DRILL-92) Cassandra storage engine

Robert Stupp (JIRA) Sun, 26 Apr 2015 06:05:49 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-92?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14513026#comment-14513026
 ]


Robert Stupp commented on DRILL-92:
-----------------------------------

[[email protected]] sorry for the late response. Here are my comments:

Tables & Keyspace cache - you don’t need to cache it. It is already cached - so 
it’s duplicate effort. Makes your code less complex if you remove the cache - 
e.g. all those cache related {{catch ExecutionException}} stuff.

I’m not sure whether the connection cache in {{CassandraConnectionManager}} 
really works. For example, if you have hosts 127.0.0.1 and 127.0.0.2 using the 
same {{Cluster}} instance, and the cache decides to evict 127.0.0.1, the 
instance for 127.0.0.2 no longer works.
Beside that you query the cache using {{List<String>}} but add using {{String}} 
for the key.
As a proposal: use the cluster name as the cache key and query for that.
I’m not sure whether you can always close the {{Cluster}} instance - in respect 
whether such an instance is still in use during a long running operation.

Both {{CassandraConnectionManager}} and {{CassandraSchemaFactory}} create 
individual {{Cluster}} instances and therefore independent resources 
(connections, threads, etc). Should be merged. If both classes are used in 
completely different contexts, please ignore this comment.
The {{Cluster}} instance in {{CassandraSchemaFactory}} is never closed.

bq. Endpoint affinity & Partition token
If you’re using the code just to assign ranges to Drill hosts, then that should 
be fine.
But do not assume anything about tokens assigned to a C* host. That code 
heavily depends on the individual cluster configuration (partitioner, topology, 
node placement (DC, rack)) and keyspace configuration. It’s not that easy, but 
manageable.

In {{org.apache.drill.exec.store.cassandra.CassandraRecordReader#setup}} you’re 
using {{QueryBuilder.token}} for paging / slicing. Unfortunately that would not 
work. Assume that you have vnodes in the C* cluster (defaults to 256 vnodes per 
C* node). Vnode tokens are assigned randomly to endpoints (=nodes) - it’s not 
like old-fashioned single token per node. You just cannot slice using the 
{{token()}} function. Even further it’s quite difficult to nicely split slices 
matching both C* nodes/vnodes *and* Drill _sub scan ranges_ (is this the 
correct wording?). Nice slicing across all nodes/vnodes is one of the weak 
sides in C*. That’s why Hadoop-on-C* recommends to prevent vnodes - they have 
the same problem. Let me think a bit about that - maybe I can provide a 
solution or at least a workaround for that.

For unit tests: take a look at https://github.com/doanduyhai/Achilles - it has 
some nice support for unit tests, which may make all that manual work to setup 
and fill keyspaces/tables superfluous.

Is it a Drill requirement that {{CassandraRecordReader#updateValueVector}} only 
mutates using {{String}}s?

General code comments:
* there’s some unused code, that can be safely removed
* in {{CassandraRecordReader}}: you can safely replace the 
{{clazz.isInstance()}}-sequence with {{clazz == ClassName.class}}

Note: the patch does not apply onto the current master - but on master as of 
March 31st. There were some breaking API changes in Drill.

For the future: I don’t know whether the current code supports Cassandra’s 
User-Defined-Types or collections (maps, sets, lists). If not, it might be a 
nice feature for later.


> Cassandra storage engine
> ------------------------
>
>                 Key: DRILL-92
>                 URL: https://issues.apache.org/jira/browse/DRILL-92
>             Project: Apache Drill
>          Issue Type: New Feature
>            Reporter: Steven Phillips
>            Assignee: Yash Sharma
>             Fix For: Future
>
>         Attachments: DRILL-92-Cassandra-Storage.patch, 
> DRILL-92-CassandraStorage.patch, DRILL-92.patch, DRILL-CASSANDRA.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-92) Cassandra storage engine

Reply via email to