Does C* (2.1) know to read just one SSTable given a skinny row w/frequently updated column(s) & STCS?

Richard Klancer Thu, 16 Feb 2017 08:32:26 -0800

Hi all,

My team is trying to determine whether to use size-tiered or leveled
compaction for some tables in an app that will be moving to production
soon. using C* 2.1.


We have a few tables that look like this:

CREATE TABLE ks.counters (
    id timeuuid PRIMARY KEY,
    count counter
)

The count for a given id is updated at rates of say 1/s to 100/s

My question is, if we use size-tiered compaction (and ignoring memtables
and counter cache in this particular case) how many SSTables should I
expect C* to read when I SELECT from this table? (Furthermore I don't mean
to make the question specific to counters -- suppose the column is a bigint
instead)

Naively, the updates leave a trail of outdated cells in multiple SSTables
until they are compacted, and those will have to be inspected in some
fashion to determine which cell has the most recent value.

>>Question<<: Does the read path:
 1) sort the SSTables by timestamp in some fashion when reading, so that it
 2) can ignore all other SSTables once it's found a more recent cell, given
STCS ?

I see some evidence suggesting this is true, for example if (1) holds, an
old wiki page suggests (2) is done by comparing the timestamp of the cell
read from the first sstable, to the max timestamp of the second SSTable.
Presumably flushes and compactions work such that the timestamp of the
recent update would always be greater than the max timestamp of other
SSTables that have that column.

Eventually I'll orient myself well enough in the C* source to answer this
question for myself, but I haven't found clarity in books or web searches.

Now, I'd like to extend the question. Suppose I have

CREATE TABLE ks.more_counters (
    id timeuuid PRIMARY KEY,
    count_1 counter,
    count_2 counter
)

And suppose there are a largish number of partitions with typically 0-10
updates for each counter, spaced relatively far apart in time. Suppose now
C* finds the most recent update of count_1 in the first SSTable it looks
at, but the most recent update of count_2 is in the 4th SSTable back.

>>Question<<: Will C* potentially have to seek to 4 SSTables in this case,
or just 2? Will C* have to actually seek to the SSTables in between the
first (where the most recent update of count_1 lives) and the 4th (where
the most recent update of count_2 lives)? Or is there an additional
optimization (in-memory indices containing per-key column name information)
that tells it "nothing to see here" when it's looking for count_2?

I'm looking in general to understand the read path better, but feel
free to mention it if you think I'm putting too much emphasis on this
very theoretical count of SSTable seeks for making the
leveled/size-tiered decision.

Thanks, and I look forward to any answers!

--Richard

Does C* (2.1) know to read just one SSTable given a skinny row w/frequently updated column(s) & STCS?

Reply via email to