Greetings,
This heuristic helps us eliminate unnecessary I/O in certain workloads and
datasets, by often many orders of magnitude. This is description of the
problems we faced and how we dealt with it — I am pretty certain this can be
easily implemented on C*, albeit will likely require a new SSTable format that
can support the semantics described below.
# Example
One of our services, a price comparison service, has many millions of products
in our datastore, and we access over 100+ rows on a single page request (almost
all of them in 2-3 MultiGets - those are executed in parallel in our datastore
implementation). This is fine, and it rarely takes more than 100ms to get back
responses from all those requests.
Because we, in practice, need to update all key=>value rows multiple times a
day (merchants tend to update their price every few hours for some odd reason),
it means that a key’s columns exist in multiple(almost always all) SSTables of
a CF, and so, we almost always have to merge the final value for each key from
all those many SSTables, as opposed to only need to access a single SSTable to
do that.
In fact, for most CFs of this service, we need to merge most of their SSTables
to get the final CF, because of that same reason (rows update very frequently,
as opposed to say, a ‘users’ CF where you typically only set the row once on
creation and very infrequently after that ).
Surely, there must have been ways to exploit this pattern and access and
update semantics. (there are).
Our SSTables are partitioned into chunks. One of those chunks is the index
chunk which holds distinctKeyId:u64 => offset:u32, sorted by distinctKeyId.
We have a map which allows us to map distinctKeyId:u64=> data chunk
region(skip list), so that this offset is an offset into the respective data
chunk region — this is so that we won’t have to store 64bit offsets there, and
that saves us 4bytes / row (every ~4GB we track another data chunk region so in
practice this is a constant operation ).
# How we are mitigating IO access and merge costs
Anyway, when enabled, with the additional cost of 64bits for each entry in the
index chunk, instead of keyId:u64=>(offset:u32), we now use
keyId:u64=>(offset:u32, latestTs:u32, signature:u32).
For each CF we are about to serialise to an SSTable, we identify the latest
timestamp of all columns(milliseconds, we need the unix timestamp). Depending
on the configuration we also do either of:
1. A digest of all column names. Currently, we use CRC32. When we build the
SSTable index chunk, we store distinctKeyId:u64 => (dataChunkSegmentOffset:u32,
latestTimestamp:u32, digest:u32)
2. Compute a mask based on the first 31 distinct column names encountered so
far. Here is some pseudocode:
Dictionary sstableDistinctColumnNames;
uint32_t mask = 0;
for (const auto &it : cf->columns)
{
const auto name = it.name;
if (sstableDistinctColumnNames.IsSet(name))
mask|=(1<<sstableDistinctColumnNames[name]);
else if (sstableDistinctColumnNames.Size() == 31)
{
// Cannot track this column, so we won’t be able to do much
about this row
mask = 0;
}
else
{
mask|=(1<<sstableDistinctColumnsNames.Size());
sstableDistinctColumnsNames.Set(name,
sstableDistinctColumnNames.Size());
}
and so, we again store distinctKeyId:u64 => (dataChunkSegmentOffset:u32,
latestTimestamp:u32, map:u32).
We also store sstableDistinctColumnsNames in the SSTable header (each SSTable
has a header chunk where we store KV records).
Each method comes with pros and cons. Though they probably make sense and you
get where this is going already, will list them later.
# GET response
So for every CF SStable, we do something like this(C++ pseudocode):
struct candidate
{
SSTable *t;
uint64_t offset;
time32_t ts;
uint32_t v;
};
Vector<candidate> candidates;
for (const auto &table : cf->sstables)
{
time32_t latestTs;
uint32_t v;
const auto actualOffset = table->Offset(key, latestTs, v);
if (!actualOffset)
continue;
candidates.Append({table, actualOffset, latestTs, v});
}
if (v.IsEmpty())
{
// Nothing here
return;
}
v.Sort([](const candidate &c1, const candidate &c2)
{
return TrivialCmp(c1.offset, c2.offset);
});
Depending on what we decide to store (mask or digest of column names):
1.
Set<uint32_t> seen;
for (const auto &it : v)
{
if (seen.IsSet(it.v)))
{
// We have seen an update for those exact columns already
continue;
}
seen.Add(it.v);
// Unserialize CF, etc, merge
}
That’s all there is to it — the core idea is that we can safely disregard
SSTable rows if the those exact columns in the CF have been found in an later
CF found earlier.
2.
uint32_t seen = 0;
// There is some logic not outlined here, where we need to map from one
SSTable’s column names to another based on the stored index, (again,
pseudocode).
for (const auto &it : v)
{
if (it.v && ((seen&v.v) == v))
{
// We don’t care about no columns in this row
continue;
}
seen|=v;
// Unserialize CF, etc, merge
}
Pros and cons
1.
PROS/CONS: Easier to compute, no restriction to first 31 distinct columns
2.
PROS/CONS: Can support interleaved columns (as opposed to many all exact
columns req. of 1)
This is configurable on a per CF basis (We usually choose 2).
Maybe you could consider such a heuristic for C*, it should probably benefit
your users too.
Apologies if any of this doesn’t make sense in anyway, feel free to ignore:)
Mark Papadakis
@markpapadakis