I am starting some work on an input-format that would let us read sstables stored in HDFS, I wonder if anyone has worked on something similar before. I did come across
http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.html However it's not open sourced/available yet. I am writing for a sanity check before I go too deep into this. I have a few questions -hoping someone here would be able to help. So far, I have been able to read sstables stored on the local file system using the SSTableScanner and the SSTableReader. I am wondering what would be a good way to proceed -having a custom implementation of RandomAccessFile like the (RandomAccessReader and the CompressedRandomAccessReader), that would use hadoop's File System API? I did search for, but could have missed -Is there some documentation on the binary format of the data, index, and stats files? That might make it simpler for me to prototype without having to go through the Cassandra Internals. I am currently working of our production deployment that is 1.1.0. Any guidance if you want to give (I am new to Cassandra Internals). Many thanks Amit