[ 
https://issues.apache.org/jira/browse/CASSANDRA-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14249655#comment-14249655
 ] 

Piotr Kołaczkowski commented on CASSANDRA-7296:
-----------------------------------------------

Honestly, I don't like this idea because of the following reasons:

# Seems like adding quite a lot of complexity to handle the following cases:
  ** What do we do if RF > 1 to avoid duplicates? 
  ** If we decide on primary token range only, what do we do if one of the 
nodes fail and some primary token ranges have no node to query from? 
  ** What if the amount of data is large enough that we'd like to actually 
split token ranges so that they are smaller and there are more spark tasks? 
This is important for bigger jobs to protect from sudden failures and not 
having to recompute too much in case of a lost spark partition.
  ** How do we fetch data from the same node in parallel? Currently it is 
perfectly fine to have one Spark node using multiple cores (mappers) that fetch 
data from the same coordinator node separately?
# It is trying to solve a theoretical problem which hasn't proved in practice 
yet.
  ** Russell Spitzer benchmarked vnodes on small/medium/larger data sets. No 
significant difference on larger data sets, and only a tiny difference on 
really small sets (constant cost of the query is higher than the cost of 
fetching the data).
  ** There are no customers reporting vnodes to be a problem for them.
  ** Theoretical reason: If data is large enough to not fit in page cache 
(hundreds of GBs on a single node), 256 additional random seeks is not going to 
cause a huge penalty because:
  *** some of them can be hidden by splitting those queries between separate 
Spark threads, so they would be submitted and executed in parallel
  *** each token range will be of size *hundreds* of MBs, which is enough large 
to hide one or two seeks

Some *real* performance problems we (and users) observed:
 * Cassandra is taking plenty of CPU when doing sequential scans. It is not 
possible to saturate bandwidth of a single laptop spinning HDD, because all 
cores of i7 CPU @2.4 GHz are 100% busy processing those small CQL cells, 
merging rows from different SSTables, ordering cells, filtering out tombstones, 
serializing etc. The problem doesn't go away after doing full compaction or 
disabling vnodes. This is a serious problem, because doing exactly the same 
query on a plain text file stored in CFS (still C*, but data stored as 2MB 
blobs) gives 3-30x performance boost (depending on who did the benchmark). We 
need to close this gap. See: https://datastax.jira.com/browse/DSP-3670
 * We need to improve backpressure mechanism at least in such a way that the 
driver or Spark connector would know to start throttling writes if the cluster 
doesn't keep up. Currently Cassandra just timeouts the writes, but once it 
happens, the driver has no clue how long to wait until it is ok to resubmit the 
update. It would be actually good to know long enough before timing out, so we 
could slow down and avoid wasteful retrying at all. Currently it is not 
possible to predict cluster load by e.g. observing write latency, because the 
latency is extremely good until it is suddently terrible (timeout). This is 
also important for other non-Spark related use cases. See 
https://issues.apache.org/jira/browse/CASSANDRA-7937.




> Add CL.COORDINATOR_ONLY
> -----------------------
>
>                 Key: CASSANDRA-7296
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7296
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Tupshin Harper
>
> For reasons such as CASSANDRA-6340 and similar, it would be nice to have a 
> read that never gets distributed, and only works if the coordinator you are 
> talking to is an owner of the row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to