[ 
https://issues.apache.org/jira/browse/CASSANDRA-8099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14591661#comment-14591661
 ] 

Sylvain Lebresne commented on CASSANDRA-8099:
---------------------------------------------

As shown by Branimir's test, range tombstone merging was indeed broken and I've 
pushed the fix for that. I've included the old version test in question (with 
some ugly modification so it tests the reverse case), but I'll look at updating 
it for the more generic version unless you want to have a shot at it.

bq. It would be great to have this clarification in the doc.

Yes, sorry about that. I've tried to clarify and add to the comment in the code 
to start with. I'll update the "guide" to make that clearer when I have a bit 
more time.

{quote}
bq. I've pushed a small semantic-changing suggestion for serialization and 
merging of RT

I hesitated doing this initially and don't remember why I didn't
{quote}

I remember now.

The problem is reverse queries: we need to be able to merge iterator in reverse 
order. And if we re-use a "start" marker as "updates" of the deletion, we won't 
know when reversed and we see a "start" marker if that's a real start or an 
update. Plus we do need both deletion time at a range boundary (the one of the 
range we left for reverse queries, and the one of the range we enter for 
forward ones). In all fairness though, I did had forgotten to actually handle 
the reverse case in the merger, so I've fixed that, but said fix is trivial in 
the current approach.

Now, there is obviously possible alternatives for dealing with RTs, but I feel 
that the current approach has a bunch of good properties:
# it's a simple model: every RT has a begining followed by an end and that's it 
(no overlapping, no inclusions of ranges, very predictible).
# it works in both forward and reverse order, and in an obvious way.
# it makes the purging of gcable RTs trivial (you just blindly collect any 
gcable marker). This is something that was broken by the alternative patch in 
particular and would require some care.
# it reuse slice bounds without adding a new concept, which is nice I think.
So while other options are up for discussions, I think there is enough 
parameters to consider that I'd prefer such potential discussion to happen in a 
separate ticket.

I'll note that the "re-imaging" of markers at the beginning of index blocks is 
actually not necessary and something I forgot to remove. We now store in each 
index block if there is an open marker at the end of that block (primary so 
that we can decide if a sstable don't have any data for a given slice just from 
the index) making this redundant. So I've removed it.

Lastly, I fully agree that {{UnfilteredRowIterators.MergedUnfiltered}} is ugly 
and I meant it as a simple temporary solution: I'm sure we can find cleaner 
alternatives (maybe through some modification of the MergeIterator API so it's 
easy to produce more than one result for a given key).


> Refactor and modernize the storage engine
> -----------------------------------------
>
>                 Key: CASSANDRA-8099
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8099
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>             Fix For: 3.0 beta 1
>
>         Attachments: 8099-nit
>
>
> The current storage engine (which for this ticket I'll loosely define as "the 
> code implementing the read/write path") is suffering from old age. One of the 
> main problem is that the only structure it deals with is the cell, which 
> completely ignores the more high level CQL structure that groups cell into 
> (CQL) rows.
> This leads to many inefficiencies, like the fact that during a reads we have 
> to group cells multiple times (to count on replica, then to count on the 
> coordinator, then to produce the CQL resultset) because we forget about the 
> grouping right away each time (so lots of useless cell names comparisons in 
> particular). But outside inefficiencies, having to manually recreate the CQL 
> structure every time we need it for something is hindering new features and 
> makes the code more complex that it should be.
> Said storage engine also has tons of technical debt. To pick an example, the 
> fact that during range queries we update {{SliceQueryFilter.count}} is pretty 
> hacky and error prone. Or the overly complex ways {{AbstractQueryPager}} has 
> to go into to simply "remove the last query result".
> So I want to bite the bullet and modernize this storage engine. I propose to 
> do 2 main things:
> # Make the storage engine more aware of the CQL structure. In practice, 
> instead of having partitions be a simple iterable map of cells, it should be 
> an iterable list of row (each being itself composed of per-column cells, 
> though obviously not exactly the same kind of cell we have today).
> # Make the engine more iterative. What I mean here is that in the read path, 
> we end up reading all cells in memory (we put them in a ColumnFamily object), 
> but there is really no reason to. If instead we were working with iterators 
> all the way through, we could get to a point where we're basically 
> transferring data from disk to the network, and we should be able to reduce 
> GC substantially.
> Please note that such refactor should provide some performance improvements 
> right off the bat but it's not it's primary goal either. It's primary goal is 
> to simplify the storage engine and adds abstraction that are better suited to 
> further optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to