There's a balancer available to re-balance DNs across the HDFS cluster in general. It is available in the $HADOOP_HOME/bin/ directory as start-balancer.sh
But what I think sqoop implies is that your data is balanced due to the map jobs it runs for imports (using a provided split factor between maps), which should make it write chunks of data out to different DataNodes. I guess you could get more information on the Sqoop mailing list [email protected], https://groups.google.com/a/cloudera.org/group/sqoop-user/topics On Thu, Mar 17, 2011 at 5:04 AM, BeThere <[email protected]> wrote: > The sqoop documentation seems to imply that it uses the key information > provided to it on the command line to ensure that the SQL data is distributed > evenly across the DFS. However I cannot see any mechanism for achieving this > explicitly other than relying on the implicit distribution provided by > default by HDFS. Is this correct or are there methods on some API that allow > me to manage the distribution to ensure that it is balanced across all nodes > in my cluster? > > Thanks, > > Andy D > > -- Harsh J http://harshj.com
