Greetings,

This heuristic helps us eliminate unnecessary I/O in certain workloads and 
datasets, by often many orders of magnitude. This is description of the 
problems we faced and how we dealt with it — I am pretty certain this can be 
easily implemented on C*, albeit will likely require a new SSTable format that 
can support the semantics described below.

# Example
One of our services, a price comparison service, has many millions of products 
in our datastore, and we access over 100+ rows on a single page request (almost 
all of them in 2-3 MultiGets - those are executed in parallel in our datastore 
implementation). This is fine, and it rarely takes more than 100ms to get back 
responses from all those requests.

Because we, in practice, need to update all key=>value rows multiple times a 
day (merchants tend to update their price every few hours for some odd reason), 
it means that a key’s columns exist in multiple(almost always all) SSTables of 
a CF, and so, we almost always have to merge the final value for each key from 
all those many SSTables, as opposed to only need to access a single SSTable to 
do that.

In fact, for most CFs of this service, we need to merge most of their SSTables 
to get the final CF, because of that same reason (rows update very frequently, 
as opposed to say, a ‘users’ CF where you typically only set the row once on 
creation and very infrequently after that ).
Surely, there  must have been ways to exploit this pattern and access and 
update semantics. (there are).

Our SSTables are partitioned into chunks. One of those chunks is the index 
chunk which holds distinctKeyId:u64 => offset:u32, sorted by distinctKeyId. 
We have a map which allows us to map  distinctKeyId:u64=> data chunk 
region(skip list), so that this offset is an offset into the respective data 
chunk region — this is so that we won’t have to store 64bit offsets there, and 
that saves us 4bytes / row (every ~4GB we track another data chunk region so in 
practice this is a constant operation ).

# How we are mitigating IO access and merge costs 
Anyway, when enabled, with the additional cost of 64bits for each entry in the 
index chunk, instead of keyId:u64=>(offset:u32), we now use 
keyId:u64=>(offset:u32, latestTs:u32, signature:u32). 

For each CF we are about to serialise to an SSTable, we identify the latest 
timestamp of all columns(milliseconds, we need the unix timestamp). Depending 
on the configuration we also do either of:
1. A digest of all column names.  Currently, we use CRC32. When we build the 
SSTable index chunk, we store distinctKeyId:u64 => (dataChunkSegmentOffset:u32, 
latestTimestamp:u32, digest:u32)
2. Compute a mask based on the first 31 distinct column names encountered so 
far. Here is some pseudocode:

Dictionary sstableDistinctColumnNames;
uint32_t mask = 0;

for (const auto &it : cf->columns)
{
        const auto name = it.name;
        
        if  (sstableDistinctColumnNames.IsSet(name))
                mask|=(1<<sstableDistinctColumnNames[name]);
        else if (sstableDistinctColumnNames.Size() == 31)
        {
                // Cannot track this column, so we won’t be able to do much 
about this row
                mask = 0;
        }
        else
        {
                mask|=(1<<sstableDistinctColumnsNames.Size());
                sstableDistinctColumnsNames.Set(name, 
sstableDistinctColumnNames.Size());
}

and so, we again store distinctKeyId:u64 => (dataChunkSegmentOffset:u32, 
latestTimestamp:u32, map:u32).
We also store sstableDistinctColumnsNames in the SSTable header (each SSTable 
has a header chunk where we store KV records).

Each method comes with pros and cons. Though they probably make sense and you 
get where this is going already, will list them later.

# GET response
So for every CF SStable, we do something like this(C++ pseudocode):
struct candidate
{
        SSTable *t;
        uint64_t offset;
        time32_t ts;
        uint32_t v;
};

Vector<candidate> candidates;

for (const auto &table : cf->sstables)
{
        time32_t latestTs;
        uint32_t v;
        const auto actualOffset = table->Offset(key, latestTs, v);

        if (!actualOffset)
                continue;
        
        candidates.Append({table, actualOffset, latestTs, v});  
}

if (v.IsEmpty())
{
        // Nothing here
        return;
}

v.Sort([](const candidate &c1, const candidate &c2)
{
return TrivialCmp(c1.offset, c2.offset);
});


Depending on what we decide to store (mask or digest of column names):
1. 
Set<uint32_t> seen;
for (const auto &it : v)
{
        if (seen.IsSet(it.v)))
        {
                // We have seen an update for those exact columns already
                continue;
        }
        
        seen.Add(it.v);
        // Unserialize CF, etc, merge
}

That’s all there is to it — the core idea is that we can safely disregard 
SSTable rows if the those exact columns in the CF have been found in an later 
CF found earlier. 

2.
uint32_t seen = 0;

// There is some logic not outlined here, where we need to map from one 
SSTable’s column names to another based on the stored index, (again, 
pseudocode).

for (const auto &it : v)
{

        if (it.v && ((seen&v.v) == v))
        {
                // We don’t care about no columns in this row
                continue;
        }
        seen|=v;
        // Unserialize CF, etc, merge
}


Pros and cons
1.
PROS/CONS: Easier to compute, no restriction to first 31 distinct columns

2. 
PROS/CONS: Can support interleaved columns (as opposed to many all exact 
columns req. of 1)

This is configurable on a per CF basis (We usually choose 2).

Maybe you could consider such a heuristic for C*, it should probably benefit 
your users too.
Apologies if any of this doesn’t make sense in anyway, feel free to ignore:)


Mark Papadakis
@markpapadakis

Reply via email to