Thank you - yes, I'm fairly confident that it will work either way. I'm
trying to find out whether there is an established best practice, and
the performance impact of the decision between RAID 0 and JBOD.
I'll check out the noatime and nodiratime for their effect on our
performance - thanks for that suggestion, as well.
David
Jason Venner wrote:
> If you put your dfs directory as a set of comma separated tokens you
> will do fine.
>
> <property>
> <name>dfs.data.dir</name>
> <value>${hadoop.tmp.dir}/dfs/data</value>
> <description>Determines where on the local filesystem an DFS data node
> should store its blocks. If this is a comma-delimited
> list of directories, then data will be stored in all named
> directories, typically on different devices.
> Directories that do not exist are ignored.
> </description>
> </property>
>
> The namenode does a lot of small writes, so raid 1, 10 is better.
>
> Also it having the file system mounts for the dfs.data.dir be noatime
> and nodiratime makes a significant performance difference.
>
> David B. Ritch wrote:
>> How well does Hadoop handle multiple independent disks per node?
>>
>> I have a cluster with 4 identical disks per node. I plan to use one
>> disk for OS and temporary storage, and dedicate the other three to
>> HDFS. Our IT folks have some disagreement as to whether the three disks
>> should be striped, or treated by HDFS as three independent disks. Could
>> someone with more HDFS experience comment on the relative advantages and
>> disadvantages to each approach?
>>
>> Here are some of my thoughts. It's a bit easier to manage a 3-disk
>> striped partition, and we wouldn't have to worry about balancing files
>> between them. Single-file I/O should be considerably faster. On the
>> other hand, I would expect typical use to require multiple files reads
>> or write simultaneously. I would expect Hadoop to be able to manage
>> read/write to/from the disks independently. Managing 3 streams to 3
>> independent devices would likely result in less disk head movement, and
>> therefore better performance. I would expect Hadoop to be able to
>> balance load between the disks fairly well. Availability doesn't really
>> differentiate between the two approaches - if a single disk dies, the
>> striped array would go down, but all its data should be replicated on
>> another datanode, anyway. And besides, I understand that datanode will
>> shut down a node, even if only one of 3 independent disks crashes.
>>
>> So - any one want to agree or disagree with these thoughts? Anyone have
>> any other ideas, or - better - benchmarks and experience with layouts
>> like these two?
>>
>> Thanks!
>>
>> David
>>
>