[
https://issues.apache.org/jira/browse/DRILL-92?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14513026#comment-14513026
]
Robert Stupp commented on DRILL-92:
-----------------------------------
[[email protected]] sorry for the late response. Here are my comments:
Tables & Keyspace cache - you don’t need to cache it. It is already cached - so
it’s duplicate effort. Makes your code less complex if you remove the cache -
e.g. all those cache related {{catch ExecutionException}} stuff.
I’m not sure whether the connection cache in {{CassandraConnectionManager}}
really works. For example, if you have hosts 127.0.0.1 and 127.0.0.2 using the
same {{Cluster}} instance, and the cache decides to evict 127.0.0.1, the
instance for 127.0.0.2 no longer works.
Beside that you query the cache using {{List<String>}} but add using {{String}}
for the key.
As a proposal: use the cluster name as the cache key and query for that.
I’m not sure whether you can always close the {{Cluster}} instance - in respect
whether such an instance is still in use during a long running operation.
Both {{CassandraConnectionManager}} and {{CassandraSchemaFactory}} create
individual {{Cluster}} instances and therefore independent resources
(connections, threads, etc). Should be merged. If both classes are used in
completely different contexts, please ignore this comment.
The {{Cluster}} instance in {{CassandraSchemaFactory}} is never closed.
bq. Endpoint affinity & Partition token
If you’re using the code just to assign ranges to Drill hosts, then that should
be fine.
But do not assume anything about tokens assigned to a C* host. That code
heavily depends on the individual cluster configuration (partitioner, topology,
node placement (DC, rack)) and keyspace configuration. It’s not that easy, but
manageable.
In {{org.apache.drill.exec.store.cassandra.CassandraRecordReader#setup}} you’re
using {{QueryBuilder.token}} for paging / slicing. Unfortunately that would not
work. Assume that you have vnodes in the C* cluster (defaults to 256 vnodes per
C* node). Vnode tokens are assigned randomly to endpoints (=nodes) - it’s not
like old-fashioned single token per node. You just cannot slice using the
{{token()}} function. Even further it’s quite difficult to nicely split slices
matching both C* nodes/vnodes *and* Drill _sub scan ranges_ (is this the
correct wording?). Nice slicing across all nodes/vnodes is one of the weak
sides in C*. That’s why Hadoop-on-C* recommends to prevent vnodes - they have
the same problem. Let me think a bit about that - maybe I can provide a
solution or at least a workaround for that.
For unit tests: take a look at https://github.com/doanduyhai/Achilles - it has
some nice support for unit tests, which may make all that manual work to setup
and fill keyspaces/tables superfluous.
Is it a Drill requirement that {{CassandraRecordReader#updateValueVector}} only
mutates using {{String}}s?
General code comments:
* there’s some unused code, that can be safely removed
* in {{CassandraRecordReader}}: you can safely replace the
{{clazz.isInstance()}}-sequence with {{clazz == ClassName.class}}
Note: the patch does not apply onto the current master - but on master as of
March 31st. There were some breaking API changes in Drill.
For the future: I don’t know whether the current code supports Cassandra’s
User-Defined-Types or collections (maps, sets, lists). If not, it might be a
nice feature for later.
> Cassandra storage engine
> ------------------------
>
> Key: DRILL-92
> URL: https://issues.apache.org/jira/browse/DRILL-92
> Project: Apache Drill
> Issue Type: New Feature
> Reporter: Steven Phillips
> Assignee: Yash Sharma
> Fix For: Future
>
> Attachments: DRILL-92-Cassandra-Storage.patch,
> DRILL-92-CassandraStorage.patch, DRILL-92.patch, DRILL-CASSANDRA.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)