[ 
https://issues.apache.org/jira/browse/CASSANDRA-8099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507740#comment-14507740
 ] 

Benedict commented on CASSANDRA-8099:
-------------------------------------

So I'm still trying to digest the entirety of the patch (amongst other things), 
and have been reticent to give feedback in advance of this. Since being fully 
conversant is likely a way off, I figure I should highlight earlier the biggest 
(labour-wise) concern I have.

Flyweights. They complicate the code, introducing a lot of extra 
implementations of classes. This is both a cognitive burden, but also an issue 
for the optimizer. As an example, Clustering looks like it could likely get 
away with just a single implementation, or perhaps two, as opposed to the 
current eight. Since these classes are accessed _everywhere_, and often, having 
efficient method despatch is important. But also classes like Sorting would be 
unnecessary, and we could depend on Java sorting, and things like the 
RowDataBlock would not need to have such complexity for overloading of 
behaviour to support both rows and collections of rows.

The upside of the flyweights AFAICT is very slim. There is only a very tiny 
amount of only temporary heap space saved, since the vast majority of the data 
(the ByteBuffer objects) are still floating around for every value. I would be 
surprised if we measurably reduced GC burden, and in fact since their lifetimes 
are often longer they may be promoted with higher likelihood. Of course, I may 
be missing some major beneficial case that is important, but either way I 
figure we should start a discussion, sooner than later, on whether or not it is 
worth retaining them in this patch. It is possible that the abstraction may be 
useful in future, but I don't think it is right now, and I would prefer to 
introduce it if and when we're confident it will be of use, and to benchmark it 
independently of these other complex changes.


> Refactor and modernize the storage engine
> -----------------------------------------
>
>                 Key: CASSANDRA-8099
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8099
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>             Fix For: 3.0
>
>         Attachments: 8099-nit
>
>
> The current storage engine (which for this ticket I'll loosely define as "the 
> code implementing the read/write path") is suffering from old age. One of the 
> main problem is that the only structure it deals with is the cell, which 
> completely ignores the more high level CQL structure that groups cell into 
> (CQL) rows.
> This leads to many inefficiencies, like the fact that during a reads we have 
> to group cells multiple times (to count on replica, then to count on the 
> coordinator, then to produce the CQL resultset) because we forget about the 
> grouping right away each time (so lots of useless cell names comparisons in 
> particular). But outside inefficiencies, having to manually recreate the CQL 
> structure every time we need it for something is hindering new features and 
> makes the code more complex that it should be.
> Said storage engine also has tons of technical debt. To pick an example, the 
> fact that during range queries we update {{SliceQueryFilter.count}} is pretty 
> hacky and error prone. Or the overly complex ways {{AbstractQueryPager}} has 
> to go into to simply "remove the last query result".
> So I want to bite the bullet and modernize this storage engine. I propose to 
> do 2 main things:
> # Make the storage engine more aware of the CQL structure. In practice, 
> instead of having partitions be a simple iterable map of cells, it should be 
> an iterable list of row (each being itself composed of per-column cells, 
> though obviously not exactly the same kind of cell we have today).
> # Make the engine more iterative. What I mean here is that in the read path, 
> we end up reading all cells in memory (we put them in a ColumnFamily object), 
> but there is really no reason to. If instead we were working with iterators 
> all the way through, we could get to a point where we're basically 
> transferring data from disk to the network, and we should be able to reduce 
> GC substantially.
> Please note that such refactor should provide some performance improvements 
> right off the bat but it's not it's primary goal either. It's primary goal is 
> to simplify the storage engine and adds abstraction that are better suited to 
> further optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to