Colin Patrick McCabe created HDFS-4817:
------------------------------------------

             Summary: make HDFS advisory caching configurable on a per-file 
basis
                 Key: HDFS-4817
                 URL: https://issues.apache.org/jira/browse/HDFS-4817
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: hdfs-client
    Affects Versions: 3.0.0
            Reporter: Colin Patrick McCabe
            Assignee: Colin Patrick McCabe
            Priority: Minor


HADOOP-7753 and related JIRAs introduced some performance optimizations for the 
DataNode.  One of them was readahead.  When readahead is enabled, the DataNode 
starts reading the next bytes it thinks it will need in the block file, before 
the client requests them.  This helps hide the latency of rotational media and 
send larger reads down to the device.  Another optimization was "drop-behind."  
Using this optimization, we could remove files from the Linux page cache after 
they were no longer needed.

Using {{dfs.datanode.drop.cache.behind.writes}} and 
{{dfs.datanode.drop.cache.behind.reads}} can improve performance  substantially 
on many MapReduce jobs.  In our internal benchmarks, we have seen speedups of 
40% on certain workloads.  The reason is because if we know the block data will 
not be read again any time soon, keeping it out of memory allows more memory to 
be used by the other processes on the system.  See HADOOP-7714 for more 
benchmarks.

We would like to turn on these configurations on a per-file or per-client 
basis, rather than on the DataNode as a whole.  This will allow more users to 
actually make use of them.  It would also be good to add unit tests for the 
drop-cache code path, to ensure that it is functioning as we expect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to