Re: About the relationship between the sstable compaction and the read path

Jeff Jirsa Wed, 09 Jan 2019 06:44:09 -0800

You’re comparing single machine key/value stores to a distributed db with a 
much richer data model (partitions/slices, statics, range reads, range 
deletions, etc). They’re going to read very differently. Instead of explaining 
why they’re not like rocks/ldb, how about you tell us what you’re trying to do 
/ learn so we can answer the real question?


Few other notes inline.

-- 
Jeff Jirsa


> On Jan 8, 2019, at 10:51 PM, Jinhua Luo <luajit...@gmail.com> wrote:
> 
> Thanks. Let me clarify my questions more.
> 
> 1) For memtable, if the selected columns (assuming they are in simple
> types) could be found in memtable only, why bother to search sstables
> then? In leveldb and rocksdb, they would stop consulting sstables if
> the memtable already fulfill the query.

We stop at the memtable if we know that’s all we need. This depends on a lot of 
factors (schema, point read vs slice, etc)

> 
> 2) For STCS and LCS, obviously, the sstables are grouped in
> generations (old mutations would promoted into next level or bucket),
> so why not search the columns level by level (or bucket by bucket)
> until all selected columns are collected? In leveldb and rocksdb, they
> do in this way.

They’re single machine and Cassandra isn’t. There’s no guarantee in Cassandra 
that the small sstables in stcs or low levels in LCS are newest:

- you can write arbitrary timestamps into the memtable
- read repair can put old data in the memtable
- streaming (bootstrap/repair) can put old data into new files
- user processes (nodetool refresh) can put old data into new files


> 
> 3) Could you explain the collection, cdt and counter types in more
> detail? Does they need to iterate all sstables? Because they could not
> be simply filtered by timestamp or value range.
> 

I can’t (combination of time available and it’s been a long time since I’ve 
dealt with that code and I don’t want to misspeak).


> For collection, when I select a column of collection type, e.g.
> map<text, text>, to ensure the whole set of map fields is collected,
> it is necessary to search in all sstables.
> 
> For cdt, it needs to ensure all fields of the cdt is collected.
> 
> For counter, it needs to merge all mutations distributed in all
> sstables to give a final state of counter value.
> 
> Am I correct? If so, then there three complex types seems less
> efficient than simple types, right?
> 
> Jeff Jirsa <jji...@gmail.com> 于2019年1月8日周二 下午11:58写道：
>> 
>> First:
>> 
>> Compaction controls how sstables are combined but not how they’re read. The 
>> read path (with one tiny exception) doesn’t know or care which compaction 
>> strategy you’re using.
>> 
>> A few more notes inline.
>> 
>>> On Jan 8, 2019, at 3:04 AM, Jinhua Luo <luajit...@gmail.com> wrote:
>>> 
>>> Hi All,
>>> 
>>> The compaction would organize the sstables, e.g. with LCS, the
>>> sstables would be categorized into levels, and the read path should
>>> read sstables level by level until the read is fulfilled, correct?
>> 
>> LCS levels are to minimize the number of sstables scanned - at most one per 
>> level - but there’s no attempt to fulfill the read with low levels beyond 
>> the filtering done by timestamp.
>> 
>>> 
>>> For STCS, it would search sstables in buckets from smallest to largest?
>> 
>> Nope. No attempt to do this.
>> 
>>> 
>>> What about other compaction cases? They would iterate all sstables?
>> 
>> In all cases, we’ll use a combination of bloom filters and sstable metadata 
>> and indices to include / exclude sstables. If the bloom filter hits, we’ll 
>> consider things like timestamps and whether or not the min/max clustering of 
>> the sstable matches the slice we care about. We don’t consult the compaction 
>> strategy, though the compaction strategy may have (in the case of LCS or 
>> TWCS) placed the sstables into a state that makes this read less expensive.
>> 
>>> 
>>> But in the codes, I'm confused a lot:
>>> In 
>>> org.apache.cassandra.db.SinglePartitionReadCommand#queryMemtableAndDiskInternal,
>>> it seems that no matter whether the selected columns (except the
>>> collection/cdt and counter cases, let's assume here the selected
>>> columns are simple cell) are collected and satisfied, it would search
>>> both memtable and all sstables, regardless of the compaction strategy.
>> 
>> There’s another that includes timestamps that will do some smart-ish 
>> exclusion of sstables that aren’t needed for the read command.
>> 
>>> 
>>> Why?
>>> 
>>> Moreover, for collection/cdt (non-frozen) and counter types, it would
>>> need to iterate all sstable to ensure the whole set of the fields are
>>> collected, correct? If so, such multi-cell or counter types are
>>> heavyweight in performance, correct?
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: About the relationship between the sstable compaction and the read path

Reply via email to