Hi, I managed to modify the Hadoop-Cassandra integration to start with a column of a CF used for indexing. In the map phase, I get keys from different CFs and get the row I need. So this all works fine, for a single node. :)
I'd like to effectively identify a set of nodes for a set of rows and get them efficiently into Hadoop. So my initial design was something like this. Have a new operation in the thrift interface that allows us to do, Map<(CF+key), List<endpoints>> client.get_endpoints ( List<CF+keys>) Functionality would be similar to node tools#getEndpoints. And, then when processing we can get the relevant endpoint relevant to each CF and key, through this without querying for node for each and every key. If the key is not found in the endpoint (due to node been added/ displaced while processing), only then we calculate the relevant end point again. I'd like to ask from the cassandra devs whether this method sounds the best way to do this or to point out any improvements/ flaws in the way I'm approaching this? Thanks in advance. -- Regards, Tharindu blog: http://mackiemathew.com/