[
https://issues.apache.org/jira/browse/HADOOP-7714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122616#comment-13122616
]
Scott Carey commented on HADOOP-7714:
-------------------------------------
{quote}I think the issue is that Linux's native readahead is not very
aggressive,
{quote}
I have been tuning my systems for quite a while with aggressive OS readahead.
The default is 128K, but it can be upped significantly which helps quite a bit
on sequential reads to SATA drives. Additionally, the 'deadline' scheduler is
better at sequential throughput under contention. I wonder how much of your
manual read-ahead is just compensating for the poor OS defaults? In other
applications, I maximized read speeds (and reduced CPU use) by using small read
buffers in Java (32KB) and large Linux read-ahead settings.
Additionaly, I always set up a separate file system for M/R temp space away
from HDFS. The HDFS one is tuned for sequential reads and fast flush from OS
buffers to disk, with the deadline scheduler. The temp space is tuned to delay
flush to disk for up to 60 seconds (small jobs don't even make it to disk this
way), and uses the CFQ scheduler.
This combination reduced the time of many of our jobs significantly (CDH2 and
CDH3) -- especially job chains with many small tasks mixed in.
The Linux tuning parameters that have a big effect on disk performance and
pagecache behavior are:
vm.dirty_ratio
vm.dirty_background_ratio
swappiness
readahead (e.g. blockdev --setra 4096 /dev/sda)
ext4 also has inode_readahead_blks=n and commit=nrsec
> Add support in native libs for OS buffer cache management
> ---------------------------------------------------------
>
> Key: HADOOP-7714
> URL: https://issues.apache.org/jira/browse/HADOOP-7714
> Project: Hadoop Common
> Issue Type: Bug
> Components: native
> Affects Versions: 0.24.0
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Attachments: graphs.pdf, hadoop-7714-2.txt, hadoop-7714-20s-prelim.txt
>
>
> Especially in shared HBase/MR situations, management of the OS buffer cache
> is important. Currently, running a big MR job will evict all of HBase's hot
> data from cache, causing HBase performance to really suffer. However, caching
> of the MR input/output is rarely useful, since the datasets tend to be larger
> than cache and not re-read often enough that the cache is used. Having access
> to the native calls {{posix_fadvise}} and {{sync_data_range}} on platforms
> where they are supported would allow us to do a better job of managing this
> cache.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira