Ticket CASSANDRA-3578 - Multithreaded CommitLog

Piotr Kołaczkowski Wed, 07 Dec 2011 08:57:02 -0800


Hello everyone,

As an interview task I've got to make CommitLog multithreaded. I'm newto Cassandra project and therefore, before I start modifying code, Ihave to make sure I understand what is going on there correctly.

Feel free to correct anything I got wrong or partially wrong.

1. The CommitLog singleton object is responsible for receivingRowMutation objects by its add method. The add method is thread-safe andis aimed to be called by many threads adding their RowMutationsindependently.

2. Each invocation of CommitLog#add puts a new task onto the queue.This task is represented by LogRecordAdder callable object, which isresponsible for actually calling the CommitLogSegment#write method fordoing all the "hard work" of serializing the RowMutation object,calculating CRC and writing that to the memory mapped CommitLogSegmentfile buffer. The add method immediately returns a Future object, whichcan be waited for (if needed) - it will block until the row mutation issaved to the log file and (optionally) synced.

3. The queued tasks are processed one-by-one, sequentially by theappropriate ICommitLogExecutorService. This service also controlssyncing the active memory mapped segments. There are two sync strategiesavailable: periodic and batched. The periodic simply calls syncperiodically by asynchronously putting appropriate sync task into thequeue, inbetween the LogRecordAdder tasks. The LogRecordAdder tasks are"done" as soon as they are written to the log, so the caller *won'twait* for the sync. On the other hand, the batched strategy(BatchCommitLogExecutorService), performs the tasks in batches, eachbatch finished with an sync operation. The tasks are marked as done*after* the sync operation is finished. This deferred task marking isachieved thanks to CheaterFutureTask class - allowing to run the taskwithout immediately marking FutureTask as done. Nice. :)

4. The serialized size of the RowMutation object is calculated twice:once before submitting to the ExecutorService - to detect if it is notlarger than the segment size, and then after being taken from the queuefor execution - to check if it fits into the active CommitLogSegment,and if it doesn't, to activate a new CommitLogSegment. Looks to me likea point needing optimisation. I couldn't find any code for caching theserialized size to avoid doing it twice.

5. The serialization, CRC calculation and actual commit log writes arehappening sequentially. The aim of this ticket is to make it parallel.


Questions:

1. What happens to the recovery, if the power goes off before the loghas been synced, and it has been written partially (e.g. it is truncatedin the middle of the RowMutation data)? Are incomplete RowMutationwrites detected only by means of CRC (CommitLog around lines 237-240),or is there some other mechanism for it?

2. Is the CommitLog#add method allowed to do some heavier computations?What is the contract for it? Does it have to return immediately or can Imove some code into it?


Solutions I consider (please comment):

1. Moving the serialized size calculation, serialization and CRCcalculation totally before the executor service queue, so that theseoperations would be parallel, and performed once per RowMutation object.The calculated size / data array / CRC value would be appended to thetask and put into the queue. Then copying that into the commit log wouldproceed sequentially - the task would contain only code for log writing.This is the safest and easiest solution, but also the least performant,because copying is still sequential and still might be a bottleneck. Thelogic of allocating new commit log segments and syncing remains unchanged.

2. Moving the serialized size calculation, serialization, CRCcalculation *and commit log writing* before the executor service queue.This raises immediately some problems / questions:a) The code for segment allocation needs to be changed, as it becomesmultithreaded. It can be done using AtomicInteger.compareAndSet, so thateach RowMutation gets its own, non-overlapping piece of commit log towrite into.b) What happens if there is not enough free space in the current activesegment? Do we allow more active segments at once? Or do we restrict theparallelism to writing just into a single active segment (I don't likeit, as it would be for certain less performant, because we would have towait for finishing the current active segement, before we can start anew one)?c) Is the recovery method ready for reading partially written (invalid)RowMutation, that is not the last mutation in the commit log? If weallow writing several row mutations parallel, it has to be.d) The tasks are sent to the queue only for wait-for-sync functionality- they would not contain any code to execute, because everything wouldbe already done.

3. Everything just as 2., but with an addition, that the serializationcode writes directly into the target memory mapped buffer and not into atemporary byte array. This would save us copying and also put lessstrain on GC.


Sorry, for such a long e-mail and best regards,
Piotr Kolaczkowski

--
Piotr Kołaczkowski
Instytut Informatyki, Politechnika Warszawska
Nowowiejska 15/19, 00-665 Warszawa
e-mail: pkola...@ii.pw.edu.pl
www: http://home.elka.pw.edu.pl/~pkolaczk/

Ticket CASSANDRA-3578 - Multithreaded CommitLog

Reply via email to