Ok, I understand about the balancer process which can be run manually, but the
sqoop documentation seems to imply that it does balancing for you, based on the
split key, as you note.
But what causes the various sqoop data import map jobs to write to different
data nodes? I.e. What stops them all writing to the same node, in the ultimate
pathological case?
Thanks,
Andy D
On 17 Mar 2011, at 00:28, Harsh J <[email protected]> wrote:
> There's a balancer available to re-balance DNs across the HDFS cluster
> in general. It is available in the $HADOOP_HOME/bin/ directory as
> start-balancer.sh
>
> But what I think sqoop implies is that your data is balanced due to
> the map jobs it runs for imports (using a provided split factor
> between maps), which should make it write chunks of data out to
> different DataNodes.
>
> I guess you could get more information on the Sqoop mailing list
> [email protected],
> https://groups.google.com/a/cloudera.org/group/sqoop-user/topics
>
> On Thu, Mar 17, 2011 at 5:04 AM, BeThere <[email protected]> wrote:
>> The sqoop documentation seems to imply that it uses the key information
>> provided to it on the command line to ensure that the SQL data is
>> distributed evenly across the DFS. However I cannot see any mechanism for
>> achieving this explicitly other than relying on the implicit distribution
>> provided by default by HDFS. Is this correct or are there methods on some
>> API that allow me to manage the distribution to ensure that it is balanced
>> across all nodes in my cluster?
>>
>> Thanks,
>>
>> Andy D
>>
>>
>
>
>
> --
> Harsh J
> http://harshj.com