Greetings, This heuristic helps us eliminate unnecessary I/O in certain workloads and datasets, by often many orders of magnitude. This is description of the problems we faced and how we dealt with it — I am pretty certain this can be easily implemented on C*, albeit will likely require a new SSTable format that can support the semantics described below.
# Example One of our services, a price comparison service, has many millions of products in our datastore, and we access over 100+ rows on a single page request (almost all of them in 2-3 MultiGets - those are executed in parallel in our datastore implementation). This is fine, and it rarely takes more than 100ms to get back responses from all those requests. Because we, in practice, need to update all key=>value rows multiple times a day (merchants tend to update their price every few hours for some odd reason), it means that a key’s columns exist in multiple(almost always all) SSTables of a CF, and so, we almost always have to merge the final value for each key from all those many SSTables, as opposed to only need to access a single SSTable to do that. In fact, for most CFs of this service, we need to merge most of their SSTables to get the final CF, because of that same reason (rows update very frequently, as opposed to say, a ‘users’ CF where you typically only set the row once on creation and very infrequently after that ). Surely, there must have been ways to exploit this pattern and access and update semantics. (there are). Our SSTables are partitioned into chunks. One of those chunks is the index chunk which holds distinctKeyId:u64 => offset:u32, sorted by distinctKeyId. We have a map which allows us to map distinctKeyId:u64=> data chunk region(skip list), so that this offset is an offset into the respective data chunk region — this is so that we won’t have to store 64bit offsets there, and that saves us 4bytes / row (every ~4GB we track another data chunk region so in practice this is a constant operation ). # How we are mitigating IO access and merge costs Anyway, when enabled, with the additional cost of 64bits for each entry in the index chunk, instead of keyId:u64=>(offset:u32), we now use keyId:u64=>(offset:u32, latestTs:u32, signature:u32). For each CF we are about to serialise to an SSTable, we identify the latest timestamp of all columns(milliseconds, we need the unix timestamp). Depending on the configuration we also do either of: 1. A digest of all column names. Currently, we use CRC32. When we build the SSTable index chunk, we store distinctKeyId:u64 => (dataChunkSegmentOffset:u32, latestTimestamp:u32, digest:u32) 2. Compute a mask based on the first 31 distinct column names encountered so far. Here is some pseudocode: Dictionary sstableDistinctColumnNames; uint32_t mask = 0; for (const auto &it : cf->columns) { const auto name = it.name; if (sstableDistinctColumnNames.IsSet(name)) mask|=(1<<sstableDistinctColumnNames[name]); else if (sstableDistinctColumnNames.Size() == 31) { // Cannot track this column, so we won’t be able to do much about this row mask = 0; } else { mask|=(1<<sstableDistinctColumnsNames.Size()); sstableDistinctColumnsNames.Set(name, sstableDistinctColumnNames.Size()); } and so, we again store distinctKeyId:u64 => (dataChunkSegmentOffset:u32, latestTimestamp:u32, map:u32). We also store sstableDistinctColumnsNames in the SSTable header (each SSTable has a header chunk where we store KV records). Each method comes with pros and cons. Though they probably make sense and you get where this is going already, will list them later. # GET response So for every CF SStable, we do something like this(C++ pseudocode): struct candidate { SSTable *t; uint64_t offset; time32_t ts; uint32_t v; }; Vector<candidate> candidates; for (const auto &table : cf->sstables) { time32_t latestTs; uint32_t v; const auto actualOffset = table->Offset(key, latestTs, v); if (!actualOffset) continue; candidates.Append({table, actualOffset, latestTs, v}); } if (v.IsEmpty()) { // Nothing here return; } v.Sort([](const candidate &c1, const candidate &c2) { return TrivialCmp(c1.offset, c2.offset); }); Depending on what we decide to store (mask or digest of column names): 1. Set<uint32_t> seen; for (const auto &it : v) { if (seen.IsSet(it.v))) { // We have seen an update for those exact columns already continue; } seen.Add(it.v); // Unserialize CF, etc, merge } That’s all there is to it — the core idea is that we can safely disregard SSTable rows if the those exact columns in the CF have been found in an later CF found earlier. 2. uint32_t seen = 0; // There is some logic not outlined here, where we need to map from one SSTable’s column names to another based on the stored index, (again, pseudocode). for (const auto &it : v) { if (it.v && ((seen&v.v) == v)) { // We don’t care about no columns in this row continue; } seen|=v; // Unserialize CF, etc, merge } Pros and cons 1. PROS/CONS: Easier to compute, no restriction to first 31 distinct columns 2. PROS/CONS: Can support interleaved columns (as opposed to many all exact columns req. of 1) This is configurable on a per CF basis (We usually choose 2). Maybe you could consider such a heuristic for C*, it should probably benefit your users too. Apologies if any of this doesn’t make sense in anyway, feel free to ignore:) Mark Papadakis @markpapadakis