Hi all I'm finishing an MSc in which my final project is to implement a new compaction strategy in Cassandra. I've discussed the main points of the strategy with other community members and received valuable feedback. However, I understand this will be a tough challenge for someone who has never worked with Cassandra, but after getting to know the technology, I've found it fascinating. Since I wanted to contribute to an open source project in my MSc Project, this makes Cassandra the ideal technology to go forward, and hence why I've chosen it.
However, since this is my first time contributing to an open source project, I've some questions on how to proceed correctly. Looking at the How To Contribute <http://wiki.apache.org/cassandra/HowToContribute> page, I see that we're supposed to create a ticket before starting working on it, however, in this case, does someone need to validate the usefulness of the strategy or can I just proceed and implement it, or do something else? Also, is this the correct mailing list to be asking this sort of questions? :) As for the code itself, if I have a question like "Should we be using an abstract class for compaction classes?" or "What is this method supposed to do?", can I ask it here? What is the best course of action to learn about the details of the code in Cassandra? I already saw that it has some comments, but probably won't be enough. The strategy I have in mind will be very simple until I finish the MSc. After the submission, I'll improve it with other features and feedback I got, but for the moment, I'll keep it at a basic level. The strategy will start only during certain periods of time (for example a time of the day where the cluster has little traffic (1)), during which, the rows will be made unique across all SSTables. These new tables will be capped at a configurable size, so after compaction, we can have multiple tables created. This operation only happens if, after a prior analysis, we find that the row exists in a number of SSTables above a certain threshold. What I'm trying to address here is the continuous high CPU usage of the LCS (1), but also the need for lots of disc space when we have big SSTables resulting from STCS. I suppose it's a naive strategy, but the aim here is to give me experience with C*, and of course I'll be happy to take suggestions. But I'll probably only use the ideas after delivering the project because, at the moment, I need to keep it simple. Otherwise, I'll never be able to submit it. :) Sorry for the long email, and thanks for all the help in advance! I'm very excited about this project and look forward to being part of this community! Best regards Pedro Gordo