We don't want to setup a parallel workflow for analytics, for which we use hadoop and it will be trivial to copy the new sstables that get created to the hdfs periodically and then have mappers read the sstable in parallel. Going through Thrift is an option -but an inefficient one and one that impacts production Cassandra.
Amit On Sat, Mar 23, 2013 at 2:40 PM, Michael Kjellman <mkjell...@barracuda.com> wrote: > Just curious, why would you want to store sstables in HDFS? > > On 3/23/13 12:43 PM, "Amit Kumar" <kumarami...@gmail.com> wrote: > >>I am starting some work on an input-format that would let us read >>sstables stored in HDFS, I wonder if anyone has worked on something >>similar before. I did come across >> >>http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.ht >>ml >> >>However it's not open sourced/available yet. >> >>I am writing for a sanity check before I go too deep into this. >> >>I have a few questions -hoping someone here would be able to help. >> >>So far, I have been able to read sstables stored on the local file >>system using the SSTableScanner and the SSTableReader. I am wondering >>what would be a good way to proceed -having a custom implementation of >>RandomAccessFile like the (RandomAccessReader and the >>CompressedRandomAccessReader), that would use hadoop's File System >>API? >> >> >>I did search for, but could have missed -Is there some documentation >>on the binary format of the data, index, and stats files? That might >>make it simpler for me to prototype without having to go through the >>Cassandra Internals. I am currently working of our production >>deployment that is 1.1.0. >> >>Any guidance if you want to give (I am new to Cassandra Internals). >> >>Many thanks >>Amit > > > Copy, by Barracuda, helps you store, protect, and share all your amazing > > things. Start today: www.copy.com.