Danil Lipovoy created HADOOP-16901: -------------------------------------- Summary: HDFS-client: boost ShortCircuit Cache Key: HADOOP-16901 URL: https://issues.apache.org/jira/browse/HADOOP-16901 Project: Hadoop Common Issue Type: New Feature Environment: 4 nodes E5-2698 v4 @ 2.20GHz, 700 Gb Mem.
8 RegionServers (2 by host) 8 tables by 64 regions by 1.88 Gb data in each = 1200 Gb total Random read in 800 threads via YCSB and a little bit updates (10% of reads) Reporter: Danil Lipovoy Attachments: hdfs_cpu.png, hdfs_reads.png I want to propose how to improve reading performance HDFS-client. The idea: create few instances SchortCircuit caches instead of one. The key points: 1. Create array of caches: {code:java} private ClientContext(String name, DfsClientConf conf, Configuration config) { ... shortCircuitCache = new ShortCircuitCache[this.clientShortCircuitNum]; for (int i = 0; i < this.clientShortCircuitNum; i++) { this.shortCircuitCache[i] = ShortCircuitCache.fromConf(scConf); } {code} 2 Then divide blocks by caches: {code:java} public ShortCircuitCache getShortCircuitCache(long idx) { return shortCircuitCache[(int) (idx % clientShortCircuitNum)]; } {code} 3. And how to call it: {code:java} ShortCircuitCache cache = clientContext.getShortCircuitCache(block.getBlockId()); {code} The last number of offset evenly distributed from 0 to 9 - thats why all caches will full approximatly the same. It is good for performance. Below the attachment, where clientShortCircuitNum = 3. There is load test reading HDFS via HBase. We can see that performance grows ~30%, CPU usage about 15%. Will try to add the link to PullRequest soon. Hope it is intresting for somebody. Ready to explain some unobvious things. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org