MSc Project - compaction strategy

Pedro Gordo Tue, 12 Jul 2016 02:43:49 -0700

Hi all

I'm finishing an MSc in which my final project is to implement a new
compaction strategy in Cassandra. I've discussed the main points of the
strategy with other community members and received valuable feedback.
However, I understand this will be a tough challenge for someone who has
never worked with Cassandra, but after getting to know the technology, I've
found it fascinating. Since I wanted to contribute to an open source
project in my MSc Project, this makes Cassandra the ideal technology to go
forward, and hence why I've chosen it.


However, since this is my first time contributing to an open source
project, I've some questions on how to proceed correctly. Looking at the How
To Contribute <http://wiki.apache.org/cassandra/HowToContribute> page, I
see that we're supposed to create a ticket before starting working on it,
however, in this case, does someone need to validate the usefulness of the
strategy or can I just proceed and implement it, or do something else?
Also, is this the correct mailing list to be asking this sort of questions?
:)

As for the code itself, if I have a question like "Should we be using an
abstract class for compaction classes?" or "What is this method supposed to
do?", can I ask it here? What is the best course of action to learn about
the details of the code in Cassandra? I already saw that it has some
comments, but probably won't be enough.

The strategy I have in mind will be very simple until I finish the MSc.
After the submission, I'll improve it with other features and feedback I
got, but for the moment, I'll keep it at a basic level. The strategy will
start only during certain periods of time (for example a time of the day
where the cluster has little traffic (1)), during which, the rows will be
made unique across all SSTables. These new tables will be capped at a
configurable size, so after compaction, we can have multiple tables
created. This operation only happens if, after a prior analysis, we find
that the row exists in a number of SSTables above a certain threshold. What
I'm trying to address here is the continuous high CPU usage of the LCS (1),
but also the need for lots of disc space when we have big SSTables
resulting from STCS. I suppose it's a naive strategy, but the aim here is
to give me experience with C*, and of course I'll be happy to take
suggestions. But I'll probably only use the ideas after delivering the
project because, at the moment, I need to keep it simple. Otherwise, I'll
never be able to submit it. :)

Sorry for the long email, and thanks for all the help in advance! I'm very
excited about this project and look forward to being part of this community!

Best regards Pedro Gordo

MSc Project - compaction strategy

Reply via email to