Efficient way of figuring out which nodes a set of keys belong to - Hadoop integration

Tharindu Mathew Thu, 22 Sep 2011 11:04:29 -0700

Hi,

I managed to modify the Hadoop-Cassandra integration to start with a column
of a CF used for indexing. In the map phase, I get keys from different CFs
and get the row I need. So this all works fine, for a single node. :)


I'd like to effectively identify a set of nodes for a set of rows and get
them efficiently into Hadoop. So my initial design was something like this.

Have a new operation in the thrift interface that allows us to do,

Map<(CF+key), List<endpoints>> client.get_endpoints ( List<CF+keys>)

Functionality would be similar to node tools#getEndpoints.

And, then when processing we can get the relevant endpoint relevant to each
CF and key, through this without querying for node for each and every key.
If the key is not found in the endpoint (due to node been added/ displaced
while processing), only then we calculate the relevant end point again.

I'd like to ask from the cassandra devs whether this method sounds the best
way to do this or to point out any improvements/ flaws in the way I'm
approaching this?

Thanks in advance.

-- 
Regards,

Tharindu

blog: http://mackiemathew.com/

Efficient way of figuring out which nodes a set of keys belong to - Hadoop integration

Reply via email to