[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

Jonathan Ellis (JIRA) Tue, 23 Aug 2011 16:04:04 -0700

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jonathan Ellis updated CASSANDRA-1608:
--------------------------------------

    Attachment: 1608-v4.txt

v4 attached.

Manifest
========

- I noticed that Manifest.generations and lastCompactedKeys could be simplified 
to arrays if we are willing to assume that no node will have more than a PB or 
so of data in a single CF.  Which feels reasonable to me even with capacity 
expanding as fast as it is. :)

- What is the 1.25 supposed to be doing here?
{code}
        // skip newlevel if the resulting sstables exceed newlevel threshold
        if (maxBytesForLevel(newLevel) < SSTableReader.getTotalBytes(added)
            && SSTableReader.getTotalBytes(getLevel(newLevel + 1)) == 0 * 1.25)
{code}

- Why the "all on the same level" special case?  Is this just saying "L0 
compactions must go into L1?"
{noformat}
        // the level for the added sstables is the max of the removed ones,
        // plus one if the removed were all on the same level
{noformat}

- removed this.  if L0 is large, it doesn't necessarily follow that L1 is large 
too.  I don't see a good reason to second-guess the scoring here.
{code}
            if (candidates.size() > 32 && bestLevel == 0)
            {
                candidates = getCandidatesFor(1);
            }
{code}

- redid L0 candidate selection to follow the LevelDB algorithm (pick one L0, 
add other L0s and L1s that overlap).  This means that if we're doing sequential 
writes we don't do "extra" work compacting non-overlapping L0s unnecessarily.  
(A niche use to be sure given our emphasis on RP but it's not a lot of code.)

- L0 only gets two sstables before it's overcapacity?  Are we still allowing L0 
sstables to be large?  if so it's not even two

- "Exposing number of SSTables in L0 as a JMX property probably isn't a bad 
idea."

- it's not correct for the create/load code to assume that the first data 
directory stays constant across restarts -- it should check all directories 
when loading

CFS
===
- not immediately clear to me if the TODOs in isKeyInRemainingSSTables are 
something i should be concerned about
- why do we need the reference mark/unmark now but not before?  is this a bug 
fix independent of 1608?
- are we losing a lot of cycles to markCurrentViewReferenced on the read path 
now that this is 1000s of sstables instead of 10s?

DataTracker
===========
- followed todo's suggestion to move incrementallyBackup to another thread
- why do we use a LinkedList in buildIntervalTree when we know the size 
beforehand?
- suspect that it's going to be faster to use interval tree to prune the search 
space for CollationController.collectTimeOrderedData, then sort that subset by 
timestamp.  Which would simplify DataTracker by not having to keep a list of 
sstables around sorted-by-timestamp -- could get rid of that entirely in favor 
of the tree, I think.

Compaction
==========
- Did this code get moved somewhere else so manual compaction request against a 
single sstable remains a no-op for SizeTiered?
{code}
            if (toCompact.size() < 2)
            {
                logger.info("Nothing to compact in " + 
cfs.getColumnFamilyName() + "." +
                            "Use forceUserDefinedCompaction if you wish to 
force compaction of single sstables " +
                            "(e.g. for tombstone collection)");
                return 0;
            }
{code}



> Redesigned Compaction
> ---------------------
>
>                 Key: CASSANDRA-1608
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1608
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Chris Goffinet
>            Assignee: Benjamin Coverston
>         Attachments: 1608-22082011.txt, 1608-v2.txt, 1608-v4.txt
>
>
> After seeing the I/O issues in CASSANDRA-1470, I've been doing some more 
> thinking on this subject that I wanted to lay out.
> I propose we redo the concept of how compaction works in Cassandra. At the 
> moment, compaction is kicked off based on a write access pattern, not read 
> access pattern. In most cases, you want the opposite. You want to be able to 
> track how well each SSTable is performing in the system. If we were to keep 
> statistics in-memory of each SSTable, prioritize them based on most accessed, 
> and bloom filter hit/miss ratios, we could intelligently group sstables that 
> are being read most often and schedule them for compaction. We could also 
> schedule lower priority maintenance on SSTable's not often accessed.
> I also propose we limit the size of each SSTable to a fix sized, that gives 
> us the ability to  better utilize our bloom filters in a predictable manner. 
> At the moment after a certain size, the bloom filters become less reliable. 
> This would also allow us to group data most accessed. Currently the size of 
> an SSTable can grow to a point where large portions of the data might not 
> actually be accessed as often.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-1608) Redesigned Compaction

Reply via email to