Hi Mark,

This specific heuristic is unlikely to be applied, as (if I've understood
it correctly) it has a very narrow window of utility to only those updates
that hit *exactly* the same clustering columns (cql rows) *and *data
columns, and is not trivial to maintain (either cpu- or memory-wise).
However two variants on this heuristic are already being considered for
inclusion as part of the new sstable format we're introducing in
CASSANDRA-7447 <https://issues.apache.org/jira/browse/CASSANDRA-7447>,
which is an extension of the the heuristic that is already applied at the
whole sstable level.

(1) Per partition, we will store the maximum timestamp (whether or not this
sits in the hash index / key cache, or in the clustering column index part,
is an open question)
 - this permits us to stop looking at files once we have a complete set of
the data we expect to return, i.e. all selected fields for the complete set
of selected rows

(2) Per clustering row, we may store a enough information to construct the
max timestamp for the row
 - this permits us to stop looking at data pages if we know we have all
selected fields for a given row only




On Sun, Sep 7, 2014 at 11:30 PM, Mark Papadakis <markuspapada...@icloud.com>
wrote:

> Greetings,
>
> This heuristic helps us eliminate unnecessary I/O in certain workloads and
> datasets, by often many orders of magnitude. This is description of the
> problems we faced and how we dealt with it — I am pretty certain this can
> be easily implemented on C*, albeit will likely require a new SSTable
> format that can support the semantics described below.
>
> # Example
> One of our services, a price comparison service, has many millions of
> products in our datastore, and we access over 100+ rows on a single page
> request (almost all of them in 2-3 MultiGets - those are executed in
> parallel in our datastore implementation). This is fine, and it rarely
> takes more than 100ms to get back responses from all those requests.
>
> Because we, in practice, need to update all key=>value rows multiple times
> a day (merchants tend to update their price every few hours for some odd
> reason), it means that a key’s columns exist in multiple(almost always all)
> SSTables of a CF, and so, we almost always have to merge the final value
> for each key from all those many SSTables, as opposed to only need to
> access a single SSTable to do that.
>
> In fact, for most CFs of this service, we need to merge most of their
> SSTables to get the final CF, because of that same reason (rows update very
> frequently, as opposed to say, a ‘users’ CF where you typically only set
> the row once on creation and very infrequently after that ).
> Surely, there  must have been ways to exploit this pattern and access and
> update semantics. (there are).
>
> Our SSTables are partitioned into chunks. One of those chunks is the index
> chunk which holds distinctKeyId:u64 => offset:u32, sorted by distinctKeyId.
> We have a map which allows us to map  distinctKeyId:u64=> data chunk
> region(skip list), so that this offset is an offset into the respective
> data chunk region — this is so that we won’t have to store 64bit offsets
> there, and that saves us 4bytes / row (every ~4GB we track another data
> chunk region so in practice this is a constant operation ).
>
> # How we are mitigating IO access and merge costs
> Anyway, when enabled, with the additional cost of 64bits for each entry in
> the index chunk, instead of keyId:u64=>(offset:u32), we now use
> keyId:u64=>(offset:u32, latestTs:u32, signature:u32).
>
> For each CF we are about to serialise to an SSTable, we identify the
> latest timestamp of all columns(milliseconds, we need the unix timestamp).
> Depending on the configuration we also do either of:
> 1. A digest of all column names.  Currently, we use CRC32. When we build
> the SSTable index chunk, we store distinctKeyId:u64 =>
> (dataChunkSegmentOffset:u32, latestTimestamp:u32, digest:u32)
> 2. Compute a mask based on the first 31 distinct column names encountered
> so far. Here is some pseudocode:
>
> Dictionary sstableDistinctColumnNames;
> uint32_t mask = 0;
>
> for (const auto &it : cf->columns)
> {
>         const auto name = it.name;
>
>         if  (sstableDistinctColumnNames.IsSet(name))
>                 mask|=(1<<sstableDistinctColumnNames[name]);
>         else if (sstableDistinctColumnNames.Size() == 31)
>         {
>                 // Cannot track this column, so we won’t be able to do
> much about this row
>                 mask = 0;
>         }
>         else
>         {
>                 mask|=(1<<sstableDistinctColumnsNames.Size());
>                 sstableDistinctColumnsNames.Set(name,
> sstableDistinctColumnNames.Size());
> }
>
> and so, we again store distinctKeyId:u64 => (dataChunkSegmentOffset:u32,
> latestTimestamp:u32, map:u32).
> We also store sstableDistinctColumnsNames in the SSTable header (each
> SSTable has a header chunk where we store KV records).
>
> Each method comes with pros and cons. Though they probably make sense and
> you get where this is going already, will list them later.
>
> # GET response
> So for every CF SStable, we do something like this(C++ pseudocode):
> struct candidate
> {
>         SSTable *t;
>         uint64_t offset;
>         time32_t ts;
>         uint32_t v;
> };
>
> Vector<candidate> candidates;
>
> for (const auto &table : cf->sstables)
> {
>         time32_t latestTs;
>         uint32_t v;
>         const auto actualOffset = table->Offset(key, latestTs, v);
>
>         if (!actualOffset)
>                 continue;
>
>         candidates.Append({table, actualOffset, latestTs, v});
> }
>
> if (v.IsEmpty())
> {
>         // Nothing here
>         return;
> }
>
> v.Sort([](const candidate &c1, const candidate &c2)
> {
> return TrivialCmp(c1.offset, c2.offset);
> });
>
>
> Depending on what we decide to store (mask or digest of column names):
> 1.
> Set<uint32_t> seen;
> for (const auto &it : v)
> {
>         if (seen.IsSet(it.v)))
>         {
>                 // We have seen an update for those exact columns already
>                 continue;
>         }
>
>         seen.Add(it.v);
>         // Unserialize CF, etc, merge
> }
>
> That’s all there is to it — the core idea is that we can safely disregard
> SSTable rows if the those exact columns in the CF have been found in an
> later CF found earlier.
>
> 2.
> uint32_t seen = 0;
>
> // There is some logic not outlined here, where we need to map from one
> SSTable’s column names to another based on the stored index, (again,
> pseudocode).
>
> for (const auto &it : v)
> {
>
>         if (it.v && ((seen&v.v) == v))
>         {
>                 // We don’t care about no columns in this row
>                 continue;
>         }
>         seen|=v;
>         // Unserialize CF, etc, merge
> }
>
>
> Pros and cons
> 1.
> PROS/CONS: Easier to compute, no restriction to first 31 distinct columns
>
> 2.
> PROS/CONS: Can support interleaved columns (as opposed to many all exact
> columns req. of 1)
>
> This is configurable on a per CF basis (We usually choose 2).
>
> Maybe you could consider such a heuristic for C*, it should probably
> benefit your users too.
> Apologies if any of this doesn’t make sense in anyway, feel free to
> ignore:)
>
>
> Mark Papadakis
> @markpapadakis

Reply via email to