Hello,

I'm looking through the source code on github for the couchbase hadoop 
connector. If I'm understanding correctly, the code that generates the 
splits takes all the possible VBuckets and breaks them up into groups based 
on the expected number of mappers set by Sqoop. This means that no matter 
what, even if a mapper is scheduled on a couchbase node, the reads from the 
dump are ALWAYS going to be sent over the network instead of possibly 
pulled from the local node's memory and just funneled into the mapper 
sitting on that local node.

Looking further into the code in the java couchbase client, I'm seeing a 
class called "VBucketNodeLocator" which has a method getServerByIndex(int 
k). If I understand this method, it's allowing me to look up the server 
that holds the vbucket number k. Is this correct?  If it is correct, would 
it make sense for this to be used in the getSplits() method in the 
CouchbaseInputFormat so that the splits for the vbuckets can be grouped by 
the server in which they live? I agree that it may not make sense for many 
who have their couchbase cluster separate from their hadoop cluster.. but 
it's a SIGNIFICANT optimization for those who have the two co-located.

Any thoughts?


Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"Couchbase" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to