[ 
https://issues.apache.org/jira/browse/CASSANDRA-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12842118#action_12842118
 ] 

Jonathan Ellis commented on CASSANDRA-847:
------------------------------------------

Some high-level thoughts:

Meta
====

please increase your column width to 120, and space around operators (n-1 -> n 
- 1) :)

Bloom filters
==========

"one huge BF" is still a bad idea.  you're cramming more into a single BF than 
it can usefully handle.  You remember CASSANDRA-790 of course.  Throwing 
columns into the same BF as row keys means that (a) your estimation of how big 
a BF you'll need gets drastically less accurate in the worst case and (b) you 
can either support many less rows, or have a much less accurate filter because 
of capacity problems.

furthermore, the more I think about this, the less I think "access column X by 
name that doesn't actually exist" is a frequent operation.  usually if you are 
accessing columns by name the column names are uniform across your rows and 
will exist close to 100% of the time.  and if you are accessing columns by 
slice then BF is useless.

Put another way, the row key is not just another level of column name and 
deserves special treatment at least in this respect.

[the one exception may be if you are accessing rows whose contents have been 
deleted, but whose tombstones haven't been GC'd.  we should make sure we don't 
actually have a BF entry for a row unless it actually contains data. I don't 
think the current code does this.]

Structures
========

the Scanner api seems like a step back from IteratingRow to me.  self-contained 
iterators are good.  any time you get more complicated than "here's an object I 
call next() on" things get buggy in my experience.  even more confusing, 
scanners can return IR (but you're not supposed to use it as an iterator?  or 
you are?  not sure).

telling bad sign: CompactionIterator is 2x as long as it used to be.

I have some thoughts on this but I am going to save this here, typing long 
things in JIRA is risky. :)

> Make the reading half of compactions memory-efficient
> -----------------------------------------------------
>
>                 Key: CASSANDRA-847
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-847
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Stu Hood
>            Priority: Critical
>             Fix For: 0.7
>
>         Attachments: 
> 0001-Add-structures-that-were-important-to-the-SSTableSca.patch, 
> 0002-Implement-most-of-the-new-SSTableScanner-interface.patch, 
> 0003-Rename-RowIndexedReader-specific-test.patch, 
> 0004-Improve-Scanner-tests-and-separate-SuperCF-handling-.patch, 
> 0005-Add-Scanner-interface-and-a-Filtered-implementation-.patch, 
> 0006-Add-support-for-compaction-of-super-CFs-and-some-tes.patch
>
>
> This issue is the next on the road to finally fixing CASSANDRA-16. To make 
> compactions memory efficient, we have to be able to perform the compaction 
> process on the smallest possible chunks that might intersect and contend 
> one-another, meaning that we need a better abstraction for reading from 
> SSTables.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to