Hi Corey,

From: Corey Nolet <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Wednesday, March 12, 2014 8:57 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Couchbase Sqoop Data Locality question

Hello,

I'm looking through the source code on github for the couchbase hadoop 
connector. If I'm understanding correctly, the code that generates the splits 
takes all the possible VBuckets and breaks them up into groups based on the 
expected number of mappers set by Sqoop. This means that no matter what, even 
if a mapper is scheduled on a couchbase node, the reads from the dump are 
ALWAYS going to be sent over the network instead of possibly pulled from the 
local node's memory and just funneled into the mapper sitting on that local 
node.

This is true, however...

Hadoop is designed to run across a cluster of systems distributed on a network. 
 Couchbase, similarly is designed to be run distributed across systems.  So 
while you're describing a possible optimization, it's not something that's part 
of either one right now.

Actually, the way sqoop runs is ideal for most deployments since it gives us 
great throughput by splitting/distributing the job.


Looking further into the code in the java couchbase client, I'm seeing a class 
called "VBucketNodeLocator" which has a method getServerByIndex(int k). If I 
understand this method, it's allowing me to look up the server that holds the 
vbucket number k. Is this correct?  If it is correct, would it make sense for 
this to be used in the getSplits() method in the CouchbaseInputFormat so that 
the splits for the vbuckets can be grouped by the server in which they live? I 
agree that it may not make sense for many who have their couchbase cluster 
separate from their hadoop cluster.. but it's a SIGNIFICANT optimization for 
those who have the two co-located.

Yes, enhancements to how the splitting is done by using that method (which 
isn't really considered public) would be an optimization.  If you want to try 
setting up that optimization, the repo is here:
https://github.com/couchbase/couchbase-hadoop-plugin

Our code review is here:
http://review.couchbase.org/#/q/status:open+project:couchbase-hadoop-plugin,n,z

Soon we'll be doing some updates there too.

Thanks,

Matt

--
Matt Ingenthron
Couchbase, Inc.

-- 
You received this message because you are subscribed to the Google Groups 
"Couchbase" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to