[
https://issues.apache.org/jira/browse/CASSANDRA-8099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589679#comment-14589679
]
Branimir Lambov commented on CASSANDRA-8099:
--------------------------------------------
The changes look good, I accept that clarifying the side effects and switching
to Iterable is not something that needs to be taken care of as part of this
ticket.
On the range tombstones doc, I'm sorry, I totally failed to explain what I had
in mind. I was looking for a statement in {{### Dealing with tombstones and
shadowed cells}} that says an open marker is immediately followed by the
corresponding close marker. This is a simple and easy to check statement which
is equivalent to having both "does not shadow its own data" (pt 1) and "there
is at most one active tombstone at any point" (pt 4).
However, to clarify this I went and looked further into how tombstones work in
the new code and after looking at it I do not think the merging code could work
correctly. To make certain I wrote a [partition merge
test|https://github.com/apache/cassandra/commit/36709f4f8f81125def076c91ac8dffae2fdf71b0].
The test results are very broken, for example:
{code}
Merging
6<=[34] <8 8 19<=[99] <=36 40 47 50<[56] <66 67 68 72<=[26] <=73 89<=[97] <97
2<=[66] <19 19 34<=[58] <=35 36 40<[94] <41 42<[26] <=48 55<=[35] <=57 58 83 88
5 8 19 31<=[49] <=31 37<=[70] <=44 46<[79] <55 65 72<[57] <85 92<[45] <93 93
results in
2<=[66] <19 19<=[99] <=36 37<=[70] 40<[94] 41<=[70] 41<=[70] 44<=[26] 44<=[26]
46<[79] 55<=[56] 55<=[56] <66 67 68 72<=[26] 72<[57] <85 88 89<=[97] <97
java.lang.AssertionError: 40<[94] follows another open marker 37<=[70]
java.lang.AssertionError: Duplicate marker 41<=[70]
java.lang.AssertionError: Deletion time mismatch for position 44
expected:<deletedAt=70, localDeletion=70> but was:<deletedAt=26,
localDeletion=26>
{code}
{code}
Merging
4 6 13<=[62] <=26 34<=[89] <=34 47 48<[99] <52 54<=[37] <=57 78<[6] <=83 85
88<=[34] <91
0 20 33<=[58] <=33 37<=[84] <40 52 57 77 77 88<[14] <91 92<[15] <96
1 8 31 41<[17] <43 62<=[25] <=67 85 87 88 92 97
results in
0 1 4 6 8 13<=[62] <=26 31 33<=[58] <=33 34<=[89] <=34 37<=[84] <40 41<[17] <43
47 48<[99] <52 52 54<=[37] <=57 62<=[25] <=67 77 77 78<[6] <=83 87 88<=[34] <91
<91 92 92<[15] <96 97
java.lang.AssertionError: <91 should be preceded by open marker, found <91
java.lang.AssertionError: Deletion time mismatch for open 88<=[34] and close
<91 expected:<deletedAt=34, localDeletion=34> but was:<deletedAt=14,
localDeletion=14>
{code}
(where {{x>\[z\]}} means tombstone open marker at {{x}} with time {{z}} and
{{<y}} stands for close marker at {{y}}).
This is also broken with Benedict's implicit close markers, and his patch is
not sufficient to fix it. The test does not even go far enough, as it does not
include rows that are within the span of a tombstone but newer (as far as I can
tell such a row should expand into three Unfiltered's: a close marker, row with
deletion time, and open marker).
Am I somehow completely misunderstanding or missing some assumptions about the
way tombstone ranges are supposed to work? Is there something very wrong in the
way I'm doing this test?
> Refactor and modernize the storage engine
> -----------------------------------------
>
> Key: CASSANDRA-8099
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8099
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Sylvain Lebresne
> Assignee: Sylvain Lebresne
> Fix For: 3.0 beta 1
>
> Attachments: 8099-nit
>
>
> The current storage engine (which for this ticket I'll loosely define as "the
> code implementing the read/write path") is suffering from old age. One of the
> main problem is that the only structure it deals with is the cell, which
> completely ignores the more high level CQL structure that groups cell into
> (CQL) rows.
> This leads to many inefficiencies, like the fact that during a reads we have
> to group cells multiple times (to count on replica, then to count on the
> coordinator, then to produce the CQL resultset) because we forget about the
> grouping right away each time (so lots of useless cell names comparisons in
> particular). But outside inefficiencies, having to manually recreate the CQL
> structure every time we need it for something is hindering new features and
> makes the code more complex that it should be.
> Said storage engine also has tons of technical debt. To pick an example, the
> fact that during range queries we update {{SliceQueryFilter.count}} is pretty
> hacky and error prone. Or the overly complex ways {{AbstractQueryPager}} has
> to go into to simply "remove the last query result".
> So I want to bite the bullet and modernize this storage engine. I propose to
> do 2 main things:
> # Make the storage engine more aware of the CQL structure. In practice,
> instead of having partitions be a simple iterable map of cells, it should be
> an iterable list of row (each being itself composed of per-column cells,
> though obviously not exactly the same kind of cell we have today).
> # Make the engine more iterative. What I mean here is that in the read path,
> we end up reading all cells in memory (we put them in a ColumnFamily object),
> but there is really no reason to. If instead we were working with iterators
> all the way through, we could get to a point where we're basically
> transferring data from disk to the network, and we should be able to reduce
> GC substantially.
> Please note that such refactor should provide some performance improvements
> right off the bat but it's not it's primary goal either. It's primary goal is
> to simplify the storage engine and adds abstraction that are better suited to
> further optimizations.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)