On Thu, Jun 18, 2009 at 10:55 AM, Alex Loddengaard <a...@cloudera.com>wrote:

> I'm a little confused what you're question is.  Are you asking why HDFS has
> consistent read/write speeds even as your cluster gets more and more data?
>
> If so, two HDFS bottlenecks that would change read/write performance as
> used
> capacity changes are name node (NN) RAM and the amount of data each of your
> data nodes (DNs) are storing.  If you have so much meta data (lots of
> files,
> blocks, etc.) that the NN java process uses most of your NN's memory, then
> you'll see a big decrease in performance.
>

To avoid this issue, simply watch swap usage on your NN. If your NN starts
swapping you will likely run into problems with your metadata operation
speed. This won't affect throughput of read/writes within a block, though.


>  This bottleneck usually only
> shows itself on large clusters with tons of metadata, though a small
> cluster
> with a wimpy NN machine will have the same bottleneck.  Similarly, if each
> of your DNs are storing close to their capacity, then reads/writes will
> begin to slow down, as each node will be responsible for streaming more and
> more data in and out.  Does that make sense?
>
> You should fill your cluster up to 80-90%.  I imagine you'd probably see a
> decrease in read/write performance depending on the tests you're running,
> though I can't say I've done this performance test before.  I'm merely
> speculating.
>

Another thing to keep in mind is that local filesystem performance begins to
suffer once a disk is more than 80% or so full. This is due to the ways that
filesystems endeavour to keep file fragmentation low. When there is little
extra space on the drive, the file system has fewer options for relocating
blocks and fighting fragmentation, so "sequential" writes and reads will
actually incur seeks on the local disk. Since the datanodes store their
blocks on the local file system, this is a factor worth considering.

-Todd

Reply via email to