[jira] Commented: (CASSANDRA-847) Make the reading half of compactions memory-efficient

Jonathan Ellis (JIRA) Wed, 10 Mar 2010 21:35:55 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843921#action_12843921
 ]


Jonathan Ellis commented on CASSANDRA-847:
------------------------------------------

I spent some quality time with this patchset on the plane but rather than focus 
on the trees I'd like to back up and look at the forest.  Git sums up my 
misgivings here:
 31 files changed [14 new], 2574 insertions(+), 521 deletions(-)
 
There is too much code churn, and simultaneously too much left to be cleaned up 
or proved later, specifically, that the new structures aren't used on the 
read/write paths or in the new data format that is their real raison d'etre.

Let's reboot this.  Save it off into a git branch, it will be useful, but let's 
start fresh and from a different direction.  I apologize for not coming to this 
conclusion sooner.  I guess I owe you two lunches now. :)

I think the most productive way to move forward on CASSANDRA-16 and company is 
as follows:

 0. Create compaction benchmark (rows compacted per second would be nice)
 1. Change keys to byte[].
 1.5 stress.py benchmark and compaction benchmark to establish baseline.  We 
should pick up some gains from this alone.
 2. Replace ColumnFamily and SuperColumn with ColumnGroup, implementing 
IColumn, and deleting IColumnContainer.  Leave compaction algorithm alone for 
now, and don't introduce Slice yet.
 2.5 stress.py benchmark for informational purposes
 3. Implement new disk format, read + write, but no compaction yet.  I strongly 
suspect that getting this working will greatly clarify how the supporting Slice 
etc structures should look.  Maybe it looks a lot like the current patchset, in 
which case you can think of this as an exercise in providing me with enough 
examples that I can understand it. :)  (And, it will be easy to crib code from 
your branch of this patchset.)  Although I still think ColumnKey should be a 
ColumnKeyRange pairing begin + end...
 3.5 stress.py benchmark, fix performance regressions if any
 
 Note that IMO it is okay to take a minor speed hit in the 2.5 benchmark, as 
long as we gain it back in 3.  But if 3.5 shows that we got way slower then it 
will be good to have 2.5 data to show where the problem(s) are.

 So at this point we can compact old -> old still but not old -> new or new -> 
new.  So the last is to remove the former and add the latter, and I suspect 
that the most natural order is
 
 4. Implement old -> new compaction
 5. Implement new -> new compaction 
 6. Benchmark new -> new compaction, compare with results from 0
 
(Note, I think starting with old -> new is easier since we can "cheat" to 
simplify the problem since we know old can fit in memory.  But maybe it is more 
natural to do new -> new first, or both together.  Doesn't really matter.)
 
I think this gives us a natural progression to where we want to go while adding 
a minimum of new structures at each step and being confident that we haven't 
made changes without having a good idea of the performance implications.

> Make the reading half of compactions memory-efficient
> -----------------------------------------------------
>
>                 Key: CASSANDRA-847
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-847
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Stu Hood
>            Priority: Critical
>             Fix For: 0.7
>
>         Attachments: 
> 0001-Add-structures-that-were-important-to-the-SSTableSca.patch, 
> 0002-Implement-most-of-the-new-SSTableScanner-interface.patch, 
> 0003-Rename-RowIndexedReader-specific-test.patch, 
> 0004-Improve-Scanner-tests-and-separate-SuperCF-handling-.patch, 
> 0005-Add-Scanner-interface-and-a-Filtered-implementation-.patch, 
> 0006-Add-support-for-compaction-of-super-CFs-and-some-tes.patch, 
> 0007-Remove-ColumnKey-bloom-filter-maintenance.patch, 
> 0008-Make-Scanner-extend-Iterator-again.patch, 
> 0009-Make-CompactionIterator-a-ReducingIterator-subclass-.patch, 
> 0010-Alternative-to-ReducingIterator-that-can-return-mult.patch
>
>
> This issue is the next on the road to finally fixing CASSANDRA-16. To make 
> compactions memory efficient, we have to be able to perform the compaction 
> process on the smallest possible chunks that might intersect and contend 
> one-another, meaning that we need a better abstraction for reading from 
> SSTables.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-847) Make the reading half of compactions memory-efficient

Reply via email to