[jira] [Comment Edited] (HDFS-15202) HDFS-client: boost ShortCircuit Cache

2020-05-18 Thread Mikhail Pryakhin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110419#comment-17110419
 ] 

Mikhail Pryakhin edited comment on HDFS-15202 at 5/18/20, 7:36 PM:
---

[~weichiu] it seems that this patch breaks test compilation in trunk


{code}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.1:testCompile 
(default-testCompile) on project hadoop-hdfs: Compilation failure: Compilation 
failure:
[ERROR] 
/home/vagrant/hadoop/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java:[244,48]
 error: ')' expected
[ERROR] 
/home/vagrant/hadoop/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java:[244,53]
 error: illegal start of expression
[ERROR] 
/home/vagrant/hadoop/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java:[244,54]
 error: ';' expected
[ERROR] 
/home/vagrant/hadoop/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java:[245,14]
 error: not a statement
[ERROR] 
/home/vagrant/hadoop/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java:[245,22]
 error: ';' expected
[ERROR] 
/home/vagrant/hadoop/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java:[245,39]
 error:  expected
[ERROR] 
/home/vagrant/hadoop/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java:[245,41]
 error: illegal start of expression
{code}



was (Author: m.pryahin):
[~weichiu] it seems that this patch breaks test compilation at trunk


{code}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.1:testCompile 
(default-testCompile) on project hadoop-hdfs: Compilation failure: Compilation 
failure:
[ERROR] 
/home/vagrant/hadoop/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java:[244,48]
 error: ')' expected
[ERROR] 
/home/vagrant/hadoop/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java:[244,53]
 error: illegal start of expression
[ERROR] 
/home/vagrant/hadoop/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java:[244,54]
 error: ';' expected
[ERROR] 
/home/vagrant/hadoop/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java:[245,14]
 error: not a statement
[ERROR] 
/home/vagrant/hadoop/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java:[245,22]
 error: ';' expected
[ERROR] 
/home/vagrant/hadoop/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java:[245,39]
 error:  expected
[ERROR] 
/home/vagrant/hadoop/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java:[245,41]
 error: illegal start of expression
{code}


> HDFS-client: boost ShortCircuit Cache
> -
>
> Key: HDFS-15202
> URL: https://issues.apache.org/jira/browse/HDFS-15202
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
> Environment: 4 nodes E5-2698 v4 @ 2.20GHz, 700 Gb Mem.
> 8 RegionServers (2 by host)
> 8 tables by 64 regions by 1.88 Gb data in each = 900 Gb total
> Random read in 800 threads via YCSB and a little bit updates (10% of reads)
>Reporter: Danil Lipovoy
>Assignee: Danil Lipovoy
>Priority: Minor
> Fix For: 3.3.1, 3.4.0
>
> Attachments: HDFS-15202-Addendum-01.patch, HDFS_CPU_full_cycle.png, 
> cpu_SSC.png, cpu_SSC2.png, hdfs_cpu.png, hdfs_reads.png, hdfs_scc_3_test.png, 
> hdfs_scc_test_full-cycle.png, locks.png, requests_SSC.png
>
>
> ТотI want to propose how to improve reading performance HDFS-client. The 
> idea: create few instances ShortCircuit caches instead of one. 
> The key points:
> 1. Create array of caches (set by 
> clientShortCircuitNum=*dfs.client.short.circuit.num*, see in the pull 
> requests below):
> {code:java}
> private ClientContext(String name, DfsClientConf conf, Configuration config) {
> ...
> shortCircuitCache = new ShortCircuitCache[this.clientShortCircuitNum];
> for (int i = 0; i < this.clientShortCircuitNum; i++) {
>   this.shortCircuitCache[i] = ShortCircuitCache.fromConf(scConf);
> }
> {code}
> 2 Then divide blocks by caches:
> {code:java}
>   public ShortCircuitCache getShortCircuitCache(long idx) {
> return shortCircuitCache[(int) (idx % clientShortCircuitNum)];
>   }
> {code}
> 3. And how to call it:
> 

[jira] [Comment Edited] (HDFS-15202) HDFS-client: boost ShortCircuit Cache

2020-05-18 Thread Mikhail Pryakhin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110419#comment-17110419
 ] 

Mikhail Pryakhin edited comment on HDFS-15202 at 5/18/20, 4:15 PM:
---

[~weichiu] it seems that this patch breaks test compilation at trunk


{code}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.1:testCompile 
(default-testCompile) on project hadoop-hdfs: Compilation failure: Compilation 
failure:
[ERROR] 
/home/vagrant/hadoop/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java:[244,48]
 error: ')' expected
[ERROR] 
/home/vagrant/hadoop/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java:[244,53]
 error: illegal start of expression
[ERROR] 
/home/vagrant/hadoop/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java:[244,54]
 error: ';' expected
[ERROR] 
/home/vagrant/hadoop/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java:[245,14]
 error: not a statement
[ERROR] 
/home/vagrant/hadoop/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java:[245,22]
 error: ';' expected
[ERROR] 
/home/vagrant/hadoop/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java:[245,39]
 error:  expected
[ERROR] 
/home/vagrant/hadoop/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java:[245,41]
 error: illegal start of expression
{code}



was (Author: m.pryahin):
[~weichiu] it seems that this patch breaks test compilation at trunk

> HDFS-client: boost ShortCircuit Cache
> -
>
> Key: HDFS-15202
> URL: https://issues.apache.org/jira/browse/HDFS-15202
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
> Environment: 4 nodes E5-2698 v4 @ 2.20GHz, 700 Gb Mem.
> 8 RegionServers (2 by host)
> 8 tables by 64 regions by 1.88 Gb data in each = 900 Gb total
> Random read in 800 threads via YCSB and a little bit updates (10% of reads)
>Reporter: Danil Lipovoy
>Assignee: Danil Lipovoy
>Priority: Minor
> Fix For: 3.3.1, 3.4.0
>
> Attachments: HDFS-15202-Addendum-01.patch, HDFS_CPU_full_cycle.png, 
> cpu_SSC.png, cpu_SSC2.png, hdfs_cpu.png, hdfs_reads.png, hdfs_scc_3_test.png, 
> hdfs_scc_test_full-cycle.png, locks.png, requests_SSC.png
>
>
> ТотI want to propose how to improve reading performance HDFS-client. The 
> idea: create few instances ShortCircuit caches instead of one. 
> The key points:
> 1. Create array of caches (set by 
> clientShortCircuitNum=*dfs.client.short.circuit.num*, see in the pull 
> requests below):
> {code:java}
> private ClientContext(String name, DfsClientConf conf, Configuration config) {
> ...
> shortCircuitCache = new ShortCircuitCache[this.clientShortCircuitNum];
> for (int i = 0; i < this.clientShortCircuitNum; i++) {
>   this.shortCircuitCache[i] = ShortCircuitCache.fromConf(scConf);
> }
> {code}
> 2 Then divide blocks by caches:
> {code:java}
>   public ShortCircuitCache getShortCircuitCache(long idx) {
> return shortCircuitCache[(int) (idx % clientShortCircuitNum)];
>   }
> {code}
> 3. And how to call it:
> {code:java}
> ShortCircuitCache cache = 
> clientContext.getShortCircuitCache(block.getBlockId());
> {code}
> The last number of offset evenly distributed from 0 to 9 - that's why all 
> caches will full approximately the same.
> It is good for performance. Below the attachment, it is load test reading 
> HDFS via HBase where clientShortCircuitNum = 1 vs 3. We can see that 
> performance grows ~30%, CPU usage about +15%. 
> Hope it is interesting for someone.
> Ready to explain some unobvious things.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15202) HDFS-client: boost ShortCircuit Cache

2020-03-15 Thread Danil Lipovoy (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059638#comment-17059638
 ] 

Danil Lipovoy edited comment on HDFS-15202 at 3/15/20, 12:39 PM:
-

Yes, you are right, there are ways to improve performance even more. On the 
other hand it would be more difficult. Can we go step by step, something like 
agile approch? Any way, since it is better then before, can we merge it? It is 
important for me) I would test some CRC32 distibution for example later. 


was (Author: pustota):
Yes, you are right, there are ways improve performance even more. On the other 
hand it would be more difficult. Can we go step by step, something like agile 
approch? Any way, since it is better then before, can we merge it? It is 
important for me) I would test some CRC32 distibution for example later. 

> HDFS-client: boost ShortCircuit Cache
> -
>
> Key: HDFS-15202
> URL: https://issues.apache.org/jira/browse/HDFS-15202
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
> Environment: 4 nodes E5-2698 v4 @ 2.20GHz, 700 Gb Mem.
> 8 RegionServers (2 by host)
> 8 tables by 64 regions by 1.88 Gb data in each = 900 Gb total
> Random read in 800 threads via YCSB and a little bit updates (10% of reads)
>Reporter: Danil Lipovoy
>Assignee: Danil Lipovoy
>Priority: Minor
> Attachments: HDFS_CPU_full_cycle.png, cpu_SSC.png, cpu_SSC2.png, 
> hdfs_cpu.png, hdfs_reads.png, hdfs_scc_3_test.png, 
> hdfs_scc_test_full-cycle.png, locks.png, requests_SSC.png
>
>
> ТотI want to propose how to improve reading performance HDFS-client. The 
> idea: create few instances ShortCircuit caches instead of one. 
> The key points:
> 1. Create array of caches (set by 
> clientShortCircuitNum=*dfs.client.short.circuit.num*, see in the pull 
> requests below):
> {code:java}
> private ClientContext(String name, DfsClientConf conf, Configuration config) {
> ...
> shortCircuitCache = new ShortCircuitCache[this.clientShortCircuitNum];
> for (int i = 0; i < this.clientShortCircuitNum; i++) {
>   this.shortCircuitCache[i] = ShortCircuitCache.fromConf(scConf);
> }
> {code}
> 2 Then divide blocks by caches:
> {code:java}
>   public ShortCircuitCache getShortCircuitCache(long idx) {
> return shortCircuitCache[(int) (idx % clientShortCircuitNum)];
>   }
> {code}
> 3. And how to call it:
> {code:java}
> ShortCircuitCache cache = 
> clientContext.getShortCircuitCache(block.getBlockId());
> {code}
> The last number of offset evenly distributed from 0 to 9 - that's why all 
> caches will full approximately the same.
> It is good for performance. Below the attachment, it is load test reading 
> HDFS via HBase where clientShortCircuitNum = 1 vs 3. We can see that 
> performance grows ~30%, CPU usage about +15%. 
> Hope it is interesting for someone.
> Ready to explain some unobvious things.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15202) HDFS-client: boost ShortCircuit Cache

2020-03-15 Thread Danil Lipovoy (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059638#comment-17059638
 ] 

Danil Lipovoy edited comment on HDFS-15202 at 3/15/20, 12:34 PM:
-

Yes, you are right, there are ways improve performance even more. On the other 
hand it would be more difficult. Can we go step by step, something like agile 
approch? Any way, since it is better then before, can we merge it? It is 
important for me) I would test some CRC32 distibution for example later. 


was (Author: pustota):
Yes, you are right, there are ways improve performance even more. On the other 
hand it would be more difficult. Can we go step by step, something like agile 
approch? Any way, since it is better then before, can we merge it? It is 
important for me) I would test some CRC32 for example later. 

> HDFS-client: boost ShortCircuit Cache
> -
>
> Key: HDFS-15202
> URL: https://issues.apache.org/jira/browse/HDFS-15202
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
> Environment: 4 nodes E5-2698 v4 @ 2.20GHz, 700 Gb Mem.
> 8 RegionServers (2 by host)
> 8 tables by 64 regions by 1.88 Gb data in each = 900 Gb total
> Random read in 800 threads via YCSB and a little bit updates (10% of reads)
>Reporter: Danil Lipovoy
>Assignee: Danil Lipovoy
>Priority: Minor
> Attachments: HDFS_CPU_full_cycle.png, cpu_SSC.png, cpu_SSC2.png, 
> hdfs_cpu.png, hdfs_reads.png, hdfs_scc_3_test.png, 
> hdfs_scc_test_full-cycle.png, locks.png, requests_SSC.png
>
>
> ТотI want to propose how to improve reading performance HDFS-client. The 
> idea: create few instances ShortCircuit caches instead of one. 
> The key points:
> 1. Create array of caches (set by 
> clientShortCircuitNum=*dfs.client.short.circuit.num*, see in the pull 
> requests below):
> {code:java}
> private ClientContext(String name, DfsClientConf conf, Configuration config) {
> ...
> shortCircuitCache = new ShortCircuitCache[this.clientShortCircuitNum];
> for (int i = 0; i < this.clientShortCircuitNum; i++) {
>   this.shortCircuitCache[i] = ShortCircuitCache.fromConf(scConf);
> }
> {code}
> 2 Then divide blocks by caches:
> {code:java}
>   public ShortCircuitCache getShortCircuitCache(long idx) {
> return shortCircuitCache[(int) (idx % clientShortCircuitNum)];
>   }
> {code}
> 3. And how to call it:
> {code:java}
> ShortCircuitCache cache = 
> clientContext.getShortCircuitCache(block.getBlockId());
> {code}
> The last number of offset evenly distributed from 0 to 9 - that's why all 
> caches will full approximately the same.
> It is good for performance. Below the attachment, it is load test reading 
> HDFS via HBase where clientShortCircuitNum = 1 vs 3. We can see that 
> performance grows ~30%, CPU usage about +15%. 
> Hope it is interesting for someone.
> Ready to explain some unobvious things.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15202) HDFS-client: boost ShortCircuit Cache

2020-03-15 Thread Danil Lipovoy (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059638#comment-17059638
 ] 

Danil Lipovoy edited comment on HDFS-15202 at 3/15/20, 12:32 PM:
-

Yes, you are right, there are ways improve performance even more. On the other 
hand it would be more difficult. Can we go step by step, something like agile 
approch? Any way, since it is better then before, can we merge it? It is 
important for me) I would test some CRC32 for example later. 


was (Author: pustota):
Yes, you are right, there are ways improve performance even more. On the other 
hand it would be more difficult. Can we go step by step, something like agile 
way? Any way, since it is better then before, can we merge it? It is important 
for me) I would test some CRC32 for example later. 

> HDFS-client: boost ShortCircuit Cache
> -
>
> Key: HDFS-15202
> URL: https://issues.apache.org/jira/browse/HDFS-15202
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
> Environment: 4 nodes E5-2698 v4 @ 2.20GHz, 700 Gb Mem.
> 8 RegionServers (2 by host)
> 8 tables by 64 regions by 1.88 Gb data in each = 900 Gb total
> Random read in 800 threads via YCSB and a little bit updates (10% of reads)
>Reporter: Danil Lipovoy
>Assignee: Danil Lipovoy
>Priority: Minor
> Attachments: HDFS_CPU_full_cycle.png, cpu_SSC.png, cpu_SSC2.png, 
> hdfs_cpu.png, hdfs_reads.png, hdfs_scc_3_test.png, 
> hdfs_scc_test_full-cycle.png, locks.png, requests_SSC.png
>
>
> ТотI want to propose how to improve reading performance HDFS-client. The 
> idea: create few instances ShortCircuit caches instead of one. 
> The key points:
> 1. Create array of caches (set by 
> clientShortCircuitNum=*dfs.client.short.circuit.num*, see in the pull 
> requests below):
> {code:java}
> private ClientContext(String name, DfsClientConf conf, Configuration config) {
> ...
> shortCircuitCache = new ShortCircuitCache[this.clientShortCircuitNum];
> for (int i = 0; i < this.clientShortCircuitNum; i++) {
>   this.shortCircuitCache[i] = ShortCircuitCache.fromConf(scConf);
> }
> {code}
> 2 Then divide blocks by caches:
> {code:java}
>   public ShortCircuitCache getShortCircuitCache(long idx) {
> return shortCircuitCache[(int) (idx % clientShortCircuitNum)];
>   }
> {code}
> 3. And how to call it:
> {code:java}
> ShortCircuitCache cache = 
> clientContext.getShortCircuitCache(block.getBlockId());
> {code}
> The last number of offset evenly distributed from 0 to 9 - that's why all 
> caches will full approximately the same.
> It is good for performance. Below the attachment, it is load test reading 
> HDFS via HBase where clientShortCircuitNum = 1 vs 3. We can see that 
> performance grows ~30%, CPU usage about +15%. 
> Hope it is interesting for someone.
> Ready to explain some unobvious things.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15202) HDFS-client: boost ShortCircuit Cache

2020-03-15 Thread Danil Lipovoy (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059638#comment-17059638
 ] 

Danil Lipovoy edited comment on HDFS-15202 at 3/15/20, 12:31 PM:
-

Yes, you are right, there are ways improve performance even more. On the other 
hand it would be more difficult. Can we go step by step, something like agile 
way? Any way, since it is better then before, can we merge it? It is important 
for me) I would test some CRC32 for example later. 


was (Author: pustota):
Yes, you are right, there are ways improve performance even more. On the other 
hand it would by more difficult. Can we go step by step, something like agile 
way? Any way, since it is better then before, can we merge it? It is important 
for me) I would test some CRC32 for example later. 

> HDFS-client: boost ShortCircuit Cache
> -
>
> Key: HDFS-15202
> URL: https://issues.apache.org/jira/browse/HDFS-15202
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
> Environment: 4 nodes E5-2698 v4 @ 2.20GHz, 700 Gb Mem.
> 8 RegionServers (2 by host)
> 8 tables by 64 regions by 1.88 Gb data in each = 900 Gb total
> Random read in 800 threads via YCSB and a little bit updates (10% of reads)
>Reporter: Danil Lipovoy
>Assignee: Danil Lipovoy
>Priority: Minor
> Attachments: HDFS_CPU_full_cycle.png, cpu_SSC.png, cpu_SSC2.png, 
> hdfs_cpu.png, hdfs_reads.png, hdfs_scc_3_test.png, 
> hdfs_scc_test_full-cycle.png, locks.png, requests_SSC.png
>
>
> ТотI want to propose how to improve reading performance HDFS-client. The 
> idea: create few instances ShortCircuit caches instead of one. 
> The key points:
> 1. Create array of caches (set by 
> clientShortCircuitNum=*dfs.client.short.circuit.num*, see in the pull 
> requests below):
> {code:java}
> private ClientContext(String name, DfsClientConf conf, Configuration config) {
> ...
> shortCircuitCache = new ShortCircuitCache[this.clientShortCircuitNum];
> for (int i = 0; i < this.clientShortCircuitNum; i++) {
>   this.shortCircuitCache[i] = ShortCircuitCache.fromConf(scConf);
> }
> {code}
> 2 Then divide blocks by caches:
> {code:java}
>   public ShortCircuitCache getShortCircuitCache(long idx) {
> return shortCircuitCache[(int) (idx % clientShortCircuitNum)];
>   }
> {code}
> 3. And how to call it:
> {code:java}
> ShortCircuitCache cache = 
> clientContext.getShortCircuitCache(block.getBlockId());
> {code}
> The last number of offset evenly distributed from 0 to 9 - that's why all 
> caches will full approximately the same.
> It is good for performance. Below the attachment, it is load test reading 
> HDFS via HBase where clientShortCircuitNum = 1 vs 3. We can see that 
> performance grows ~30%, CPU usage about +15%. 
> Hope it is interesting for someone.
> Ready to explain some unobvious things.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15202) HDFS-client: boost ShortCircuit Cache

2020-03-15 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059624#comment-17059624
 ] 

Lisheng Sun edited comment on HDFS-15202 at 3/15/20, 11:43 AM:
---

Two situations:
1. when clientShortCircuitNum is 10, The ShortCircuitCache's for the blockId is 
not very uniform.
2. For example if clientShortCircuitNum is 3, when a lot of blockids of SSR are 
***1, ***4, ***7, this situation will fall into a ShortCircuitCache.

Since the real environment blockid is completely unpredictable,  i suggest  it 
is possible to design a strategy which is allocated to a specific 
ShortCircuitCache. This should improve performance even more.



was (Author: leosun08):
Two situations:
1. when clientShortCircuitNum is 10, The ShortCircuitCache's for the blockId is 
not very uniform.
2. For example if clientShortCircuitNum is 3, when a lot of blockids of SSR are 
***1, ***4, ***7, this situation will fall into a ShortCircuitCache.
So i suggest,  is it possible to design a strategy which is allocated to a 
specific ShortCircuitCache. This should improve performance even more.


> HDFS-client: boost ShortCircuit Cache
> -
>
> Key: HDFS-15202
> URL: https://issues.apache.org/jira/browse/HDFS-15202
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
> Environment: 4 nodes E5-2698 v4 @ 2.20GHz, 700 Gb Mem.
> 8 RegionServers (2 by host)
> 8 tables by 64 regions by 1.88 Gb data in each = 900 Gb total
> Random read in 800 threads via YCSB and a little bit updates (10% of reads)
>Reporter: Danil Lipovoy
>Assignee: Danil Lipovoy
>Priority: Minor
> Attachments: HDFS_CPU_full_cycle.png, cpu_SSC.png, cpu_SSC2.png, 
> hdfs_cpu.png, hdfs_reads.png, hdfs_scc_3_test.png, 
> hdfs_scc_test_full-cycle.png, locks.png, requests_SSC.png
>
>
> ТотI want to propose how to improve reading performance HDFS-client. The 
> idea: create few instances ShortCircuit caches instead of one. 
> The key points:
> 1. Create array of caches (set by 
> clientShortCircuitNum=*dfs.client.short.circuit.num*, see in the pull 
> requests below):
> {code:java}
> private ClientContext(String name, DfsClientConf conf, Configuration config) {
> ...
> shortCircuitCache = new ShortCircuitCache[this.clientShortCircuitNum];
> for (int i = 0; i < this.clientShortCircuitNum; i++) {
>   this.shortCircuitCache[i] = ShortCircuitCache.fromConf(scConf);
> }
> {code}
> 2 Then divide blocks by caches:
> {code:java}
>   public ShortCircuitCache getShortCircuitCache(long idx) {
> return shortCircuitCache[(int) (idx % clientShortCircuitNum)];
>   }
> {code}
> 3. And how to call it:
> {code:java}
> ShortCircuitCache cache = 
> clientContext.getShortCircuitCache(block.getBlockId());
> {code}
> The last number of offset evenly distributed from 0 to 9 - that's why all 
> caches will full approximately the same.
> It is good for performance. Below the attachment, it is load test reading 
> HDFS via HBase where clientShortCircuitNum = 1 vs 3. We can see that 
> performance grows ~30%, CPU usage about +15%. 
> Hope it is interesting for someone.
> Ready to explain some unobvious things.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15202) HDFS-client: boost ShortCircuit Cache

2020-03-12 Thread Danil Lipovoy (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058076#comment-17058076
 ] 

Danil Lipovoy edited comment on HDFS-15202 at 3/12/20, 4:57 PM:


[~leosun08]

Thanks for interest!)
Let me provide more details. 
I added some logging:

{code:java}
  private BlockReader getBlockReaderLocal() throws InvalidToken {
...
LOG.info("SSC blockId: " + block.getBlockId());
ShortCircuitCache cache = 
clientContext.getShortCircuitCache(block.getBlockId());
{code}

And run reading test via HBase. We can see the log output:

cat hbase-cmf-hbase-REGIONSERVER-wx1122-02.ru.log.out |grep "SSC blockId"
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256236
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251488
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256526
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110252104
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256969
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256965
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256382
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251751
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256871
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251769
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256825
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256241
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256548
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251488
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256808
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110252097
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110257035
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256779
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256331
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251691
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251521
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835
2020-03-12 18:47:32,452 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251691
2020-03-12 18:47:32,455 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251769

We can see that last digit is evenly distributed. There are:
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403 -> 3
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835 -> 5
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256236 -> 6

Then I collected some statistics:
{code:java}
cat hbase-cmf-hbase-REGIONSERVER-wx1122-02.ru.out |grep "SSC blockId"|awk 
'{print substr($0,length,1)}' >> ~/ids.txt
{code}

{code:java}
cat ~/ids.txt   | sort | uniq -c | sort -nr | awk '{printf "%-8s%s\n", $2, 
$1}'|sort
{code}

0   645813
1   559617
2   532624
3   551147
4   484945
5   465928
6   570635
7   473285
8   525565
9   447981

It means in the logs the last digit 0 found 645813 times 
Digit 1 - 559617 times
etc. 
Quite evenly. 
When we divide blockId by modulus the blocks will cached evenly too. For 
example if clientShortCircuitNum = 3, then last digits will go to:
blockId *0 % 3 -> shortCircuitCache[0]
blockId *1 % 3 -> shortCircuitCache[1]
blockId *2 % 3 -> shortCircuitCache[2]
blockId *3 % 3 -> shortCircuitCache[0]
blockId *4 % 3 -> shortCircuitCache[1]
blockId *5 % 3 -> shortCircuitCache[2]
blockId *6 % 3 -> shortCircuitCache[0]
blockId *7 % 3 -> shortCircuitCache[1]
blockId *8 % 3 -> shortCircuitCache[2]
blockId *9 % 3 -> shortCircuitCache[0]

There a little bit more hit to [0] then [1] or [2], but not too much (4 vs 3 
and 3) and it works good. When clientShortCircuitNum = 2 or 5 all distribute 
perfect)



was (Author: pustota):
[~leosun08]


[jira] [Comment Edited] (HDFS-15202) HDFS-client: boost ShortCircuit Cache

2020-03-12 Thread Danil Lipovoy (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058076#comment-17058076
 ] 

Danil Lipovoy edited comment on HDFS-15202 at 3/12/20, 4:46 PM:


[~leosun08]

Thanks for interest!)
Let me provide more details. 
I added some logging:

{code:java}
  private BlockReader getBlockReaderLocal() throws InvalidToken {
...
LOG.info("SSC blockId: " + block.getBlockId());
ShortCircuitCache cache = 
clientContext.getShortCircuitCache(block.getBlockId());
{code}

And run reading test via HBase. We can see the log output:

cat hbase-cmf-hbase-REGIONSERVER-wx1122-02.ru.log.out |grep "SSC blockId"
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256236
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251488
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256526
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110252104
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256969
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256965
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256382
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251751
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256871
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251769
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256825
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256241
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256548
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251488
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256808
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110252097
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110257035
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256779
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256331
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251691
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251521
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835
2020-03-12 18:47:32,452 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251691
2020-03-12 18:47:32,455 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251769

We can see that last digit is evenly distributed. There are:
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403 -> 3
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835 -> 5
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256236 -> 6

Then I collected some statistics:
{code:java}
cat hbase-cmf-hbase-REGIONSERVER-wx1122-02.ru.out |grep "SSC blockId"|awk 
'{print substr($0,length,1)}' >> ~/ids.txt
{code}

{code:java}
cat ~/ids.txt   | sort | uniq -c | sort -nr | awk '{printf "%-8s%s\n", $2, 
$1}'|sort
{code}

0   645813
1   559617
2   532624
3   551147
4   484945
5   465928
6   570635
7   473285
8   525565
9   447981

It means in the logs the last digit 0 found 645813 times 
Digit 1 - 559617 times
etc. 
Quite evenly. 
When we divide blockId by modulus the blocks will cached evenly too. For 
example if clientShortCircuitNum = 3, then last digits will go to:
blockId *0 % 3 -> shortCircuitCache[0]
blockId *1 % 3 -> shortCircuitCache[1]
blockId *2 % 3 -> shortCircuitCache[2]
blockId *3 % 3 -> shortCircuitCache[0]
blockId *4 % 3 -> shortCircuitCache[1]
blockId *5 % 3 -> shortCircuitCache[2]
blockId *6 % 3 -> shortCircuitCache[0]
blockId *7 % 3 -> shortCircuitCache[1]
blockId *8 % 3 -> shortCircuitCache[2]
blockId *9 % 3 -> shortCircuitCache[0]

There a little bit more [0] then [1] or [2], but not too much and works good.
When clientShortCircuitNum = 2 or 5 all distribute perfect)



was (Author: pustota):
[~leosun08]

Thanks for interest!)
Let me 

[jira] [Comment Edited] (HDFS-15202) HDFS-client: boost ShortCircuit Cache

2020-03-12 Thread Danil Lipovoy (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058076#comment-17058076
 ] 

Danil Lipovoy edited comment on HDFS-15202 at 3/12/20, 4:45 PM:


[~leosun08]

Thanks for interest!)
Let me provide more details. 
I added some logging:

{code:java}
  private BlockReader getBlockReaderLocal() throws InvalidToken {
...
LOG.info("SSC blockId: " + block.getBlockId());
ShortCircuitCache cache = 
clientContext.getShortCircuitCache(block.getBlockId());
{code}

And run reading test via HBase. We can see the log output:

cat hbase-cmf-hbase-REGIONSERVER-wx1122-02.ru.log.out |grep "SSC blockId"
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256236
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251488
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256526
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110252104
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256969
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256965
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256382
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251751
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256871
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251769
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256825
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256241
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256548
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251488
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256808
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110252097
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110257035
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256779
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256331
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251691
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251521
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835
2020-03-12 18:47:32,452 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251691
2020-03-12 18:47:32,455 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251769

We can see that last digit is evenly distributed. There are:
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403 -> 3
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835 -> 5
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256236 -> 6

Then I collected some statistics:
{code:java}
cat hbase-cmf-hbase-REGIONSERVER-wx1122-02.ru.out |grep "SSC blockId"|awk 
'{print substr($0,length,1)}' >> ~/ids.txt
{code}

{code:java}
cat ~/ids.txt   | sort | uniq -c | sort -nr | awk '{printf "%-8s%s\n", $2, 
$1}'|sort
{code}

0   645813
1   559617
2   532624
3   551147
4   484945
5   465928
6   570635
7   473285
8   525565
9   447981

It means in the logs the last digit 0 found 645813 times 
Digit 1 - 559617 times
etc. 
Quite evenly. 
When we divide blockId by modulus the blocks will cached evenly too. For 
example if clientShortCircuitNum = 3, then last digits will go to:
blockId *0 % 3 -> shortCircuitCache[0]
blockId *1 % 3 -> shortCircuitCache[1]
blockId *2 % 3 -> shortCircuitCache[2]
blockId *3 % 3 -> shortCircuitCache[0]
blockId *4 % 3 -> shortCircuitCache[1]
blockId *5 % 3 -> shortCircuitCache[2]
blockId *6 % 3 -> shortCircuitCache[0]
blockId *7 % 3 -> shortCircuitCache[1]
blockId *8 % 3 -> shortCircuitCache[2]
blockId *9 % 3 -> shortCircuitCache[0]

There a little bit more [0] then [1] or [2], but not too much and works good)



was (Author: pustota):
[~leosun08]

Thanks for interest!)
Let me provide more details. 
I added some logging:

{code:java}
  

[jira] [Comment Edited] (HDFS-15202) HDFS-client: boost ShortCircuit Cache

2020-03-12 Thread Danil Lipovoy (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058076#comment-17058076
 ] 

Danil Lipovoy edited comment on HDFS-15202 at 3/12/20, 4:42 PM:


[~leosun08]

Thanks for interest!)
Let me provide more details. 
I added some logging:

{code:java}
  private BlockReader getBlockReaderLocal() throws InvalidToken {
...
LOG.info("SSC blockId: " + block.getBlockId());
ShortCircuitCache cache = 
clientContext.getShortCircuitCache(block.getBlockId());
{code}

And run reading test via HBase. We can see the log output:

cat hbase-cmf-hbase-REGIONSERVER-wx1122-02.ru.log.out |grep "SSC blockId"
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256236
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251488
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256526
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110252104
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256969
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256965
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256382
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251751
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256871
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251769
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256825
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256241
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256548
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251488
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256808
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110252097
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110257035
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256779
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256331
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251691
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251521
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835
2020-03-12 18:47:32,452 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251691
2020-03-12 18:47:32,455 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251769

We can see that last digit is evenly distributed. There are:
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403 -> 3
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835 -> 5
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256236 -> 6

Then I collected some statistics:
{code:java}
cat hbase-cmf-hbase-REGIONSERVER-wx1122-02.ru.out |grep "SSC blockId"|awk 
'{print substr($0,length,1)}' >> ~/ids.txt
{code}

{code:java}
cat ~/ids.txt   | sort | uniq -c | sort -nr | awk '{printf "%-8s%s\n", $2, 
$1}'|sort
{code}

0   645813
1   559617
2   532624
3   551147
4   484945
5   465928
6   570635
7   473285
8   525565
9   447981

It means in the logs the last digit 0 found 645813 times 
Digit 1 - 559617 times
etc. 
Quite evenly. 
When we divide blockId by modulus the blocks will cached evenly too. For 
example if clientShortCircuitNum = 3, then last digits will go to:
blockId *0 -> shortCircuitCaches[0]
blockId *1 -> shortCircuitCaches[1]
blockId *2 -> shortCircuitCaches[2]
blockId *3 -> shortCircuitCaches[0]
blockId *4 -> shortCircuitCaches[1]
blockId *5 -> shortCircuitCaches[2]
blockId *6 -> shortCircuitCaches[0]
blockId *7 -> shortCircuitCaches[1]
blockId *8 -> shortCircuitCaches[2]
blockId *9 -> shortCircuitCaches[0]

There a little bit more [0] then [1] or [2], but not too much and works good)



was (Author: pustota):
[~leosun08]

Thanks for interest!)
Let me provide more details. 
I added some logging:

{code:java}
  private BlockReader 

[jira] [Comment Edited] (HDFS-15202) HDFS-client: boost ShortCircuit Cache

2020-03-12 Thread Danil Lipovoy (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058076#comment-17058076
 ] 

Danil Lipovoy edited comment on HDFS-15202 at 3/12/20, 4:41 PM:


[~leosun08]

Thanks for interest!)
Let me provide more details. 
I added some logging:

{code:java}
  private BlockReader getBlockReaderLocal() throws InvalidToken {
...
LOG.info("SSC blockId: " + block.getBlockId());
ShortCircuitCache cache = 
clientContext.getShortCircuitCache(block.getBlockId());
{code}

And run reading test via HBase. We can see the log output:

cat hbase-cmf-hbase-REGIONSERVER-wx1122-02.ru.log.out |grep "SSC blockId"
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256236
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251488
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256526
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110252104
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256969
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256965
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256382
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251751
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256871
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251769
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256825
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256241
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256548
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251488
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256808
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110252097
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110257035
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256779
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256331
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251691
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251521
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835
2020-03-12 18:47:32,452 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251691
2020-03-12 18:47:32,455 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251769

We can see that last digit is evenly distributed. There are:
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403 -> 3
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835 -> 5
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256236 -> 6

Then I collected some statistics:

cat hbase-cmf-hbase-REGIONSERVER-wx1122-02.ru.out |grep "SSC blockId"|awk 
\'{print substr($0,length,1)}\' >> ~/ids.txt
cat ~/ids.txt   | sort | uniq -c | sort -nr | awk \'{printf "%-8s%s\n", $2, 
$1}\'|sort
0   645813
1   559617
2   532624
3   551147
4   484945
5   465928
6   570635
7   473285
8   525565
9   447981

It means in the logs the last digit 0 found 645813 times 
Digit 1 - 559617 times
etc. 
Quite evenly. 
When we divide blockId by modulus the blocks will cached evenly too. For 
example if clientShortCircuitNum = 3, then last digits will go to:
blockId *0 -> shortCircuitCaches[0]
blockId *1 -> shortCircuitCaches[1]
blockId *2 -> shortCircuitCaches[2]
blockId *3 -> shortCircuitCaches[0]
blockId *4 -> shortCircuitCaches[1]
blockId *5 -> shortCircuitCaches[2]
blockId *6 -> shortCircuitCaches[0]
blockId *7 -> shortCircuitCaches[1]
blockId *8 -> shortCircuitCaches[2]
blockId *9 -> shortCircuitCaches[0]

There a little bit more [0] then [1] or [2], but not too much and works good)



was (Author: pustota):
[~leosun08]

Thanks for interest!)
Let me provide more details. 
I added some logging:

{code:java}
  private BlockReader getBlockReaderLocal() throws InvalidToken {

[jira] [Comment Edited] (HDFS-15202) HDFS-client: boost ShortCircuit Cache

2020-03-12 Thread Danil Lipovoy (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058076#comment-17058076
 ] 

Danil Lipovoy edited comment on HDFS-15202 at 3/12/20, 4:40 PM:


[~leosun08]

Thanks for interest!)
Let me provide more details. 
I added some logging:

{code:java}
  private BlockReader getBlockReaderLocal() throws InvalidToken {
...
LOG.info("SSC blockId: " + block.getBlockId());
ShortCircuitCache cache = 
clientContext.getShortCircuitCache(block.getBlockId());
{code}

And run reading test via HBase. We can see the log output:

cat hbase-cmf-hbase-REGIONSERVER-wx1122-02.ru.log.out |grep "SSC blockId"
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256236
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251488
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256526
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110252104
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256969
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256965
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256382
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251751
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256871
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251769
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256825
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256241
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256548
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251488
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256808
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110252097
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110257035
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256779
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256331
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251691
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251521
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835
2020-03-12 18:47:32,452 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251691
2020-03-12 18:47:32,455 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251769

We can see that last digit is evenly distributed. There are:
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403 -> 3
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835 -> 5
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256236 -> 6

Then I collected some statistics:

cat hbase-cmf-hbase-REGIONSERVER-wx1122-02.ru.out |grep "SSC blockId"|awk 
''{print substr($0,length,1)}'' >> ~/ids.txt
cat ~/ids.txt   | sort | uniq -c | sort -nr | awk ''{printf "%-8s%s\n", $2, 
$1}''|sort
0   645813
1   559617
2   532624
3   551147
4   484945
5   465928
6   570635
7   473285
8   525565
9   447981

It means in the logs the last digit 0 found 645813 times 
Digit 1 - 559617 times
etc. 
Quite evenly. 
When we divide blockId by modulus the blocks will cached evenly too. For 
example if clientShortCircuitNum = 3, then last digits will go to:
blockId *0 -> shortCircuitCaches[0]
blockId *1 -> shortCircuitCaches[1]
blockId *2 -> shortCircuitCaches[2]
blockId *3 -> shortCircuitCaches[0]
blockId *4 -> shortCircuitCaches[1]
blockId *5 -> shortCircuitCaches[2]
blockId *6 -> shortCircuitCaches[0]
blockId *7 -> shortCircuitCaches[1]
blockId *8 -> shortCircuitCaches[2]
blockId *9 -> shortCircuitCaches[0]

There a little bit more [0] then [1] or [2], but not too much and works good)



was (Author: pustota):
[~leosun08]

Thanks for interest!)
Let me provide more details. 
I added some logging:

{code:java}
  private BlockReader getBlockReaderLocal() throws InvalidToken {

[jira] [Comment Edited] (HDFS-15202) HDFS-client: boost ShortCircuit Cache

2020-03-12 Thread Danil Lipovoy (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058076#comment-17058076
 ] 

Danil Lipovoy edited comment on HDFS-15202 at 3/12/20, 4:40 PM:


[~leosun08]

Thanks for interest!)
Let me provide more details. 
I added some logging:

{code:java}
  private BlockReader getBlockReaderLocal() throws InvalidToken {
...
LOG.info("SSC blockId: " + block.getBlockId());
ShortCircuitCache cache = 
clientContext.getShortCircuitCache(block.getBlockId());
{code}

And run reading test via HBase. We can see the log output:

cat hbase-cmf-hbase-REGIONSERVER-wx1122-02.ru.log.out |grep "SSC blockId"
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256236
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251488
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256526
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110252104
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256969
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256965
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256382
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251751
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256871
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251769
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256825
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256241
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256548
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251488
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256808
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110252097
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110257035
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256779
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256331
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251691
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251521
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835
2020-03-12 18:47:32,452 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251691
2020-03-12 18:47:32,455 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251769

We can see that last digit is evenly distributed. There are:
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403 -> 3
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835 -> 5
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256236 -> 6

Then I collected some statistics:

cat hbase-cmf-hbase-REGIONSERVER-wx1122-02.ru.out |grep "SSC blockId"|awk 
'{print substr($0,length,1)}' >> ~/ids.txt
cat ~/ids.txt   | sort | uniq -c | sort -nr | awk '{printf "%-8s%s\n", $2, 
$1}'|sort
0   645813
1   559617
2   532624
3   551147
4   484945
5   465928
6   570635
7   473285
8   525565
9   447981

It means in the logs the last digit 0 found 645813 times 
Digit 1 - 559617 times
etc. 
Quite evenly. 
When we divide blockId by modulus the blocks will cached evenly too. For 
example if clientShortCircuitNum = 3, then last digits will go to:
blockId *0 -> shortCircuitCaches[0]
blockId *1 -> shortCircuitCaches[1]
blockId *2 -> shortCircuitCaches[2]
blockId *3 -> shortCircuitCaches[0]
blockId *4 -> shortCircuitCaches[1]
blockId *5 -> shortCircuitCaches[2]
blockId *6 -> shortCircuitCaches[0]
blockId *7 -> shortCircuitCaches[1]
blockId *8 -> shortCircuitCaches[2]
blockId *9 -> shortCircuitCaches[0]

There a little bit more [0] then [1] or [2], but not too much and works good)



was (Author: pustota):
[~leosun08]

Thanks for interest!)
Let me provide more details. 
I added some logging:

{code:java}
  private BlockReader getBlockReaderLocal() throws InvalidToken {
...
   

[jira] [Comment Edited] (HDFS-15202) HDFS-client: boost ShortCircuit Cache

2020-03-12 Thread Danil Lipovoy (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058076#comment-17058076
 ] 

Danil Lipovoy edited comment on HDFS-15202 at 3/12/20, 4:38 PM:


[~leosun08]

Thanks for interest!)
Let me provide more details. 
I added some logging:

{code:java}
  private BlockReader getBlockReaderLocal() throws InvalidToken {
...
LOG.info("SSC blockId: " + block.getBlockId());
ShortCircuitCache cache = 
clientContext.getShortCircuitCache(block.getBlockId());
{code}

And run reading test via HBase. We can see the log output:

cat hbase-cmf-hbase-REGIONSERVER-wx1122-02.ru.log.out |grep "SSC blockId"
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256236
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251488
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256526
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110252104
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256969
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256965
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256382
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251751
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256871
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251769
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256825
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256241
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256548
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251488
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256808
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110252097
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110257035
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256779
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256331
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251691
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251521
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835
2020-03-12 18:47:32,452 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251691
2020-03-12 18:47:32,455 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251769

We can see that last digit is evenly distributed. There are:
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403 -> 3
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835 -> 5
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256236 -> 6

Then I collected some statistics:

cat hbase-cmf-hbase-REGIONSERVER-wx1122-02.ru.out |grep "SSC blockId"|awk 
'{print substr($0,length,1)}' >> ~/ids.txt
cat ~/ids.txt   | sort | uniq -c | sort -nr | awk '{printf "%-8s%s\n", $2, 
$1}'|sort
0   645813
1   559617
2   532624
3   551147
4   484945
5   465928
6   570635
7   473285
8   525565
9   447981

It means the logs the last digit 0 found 645813 times 
Digit 1 - 559617 times
etc. 
Quite evenly. 
When we divide blockId by modulus the blocks will cached evenly too. For 
example if clientShortCircuitNum = 3, then last digts will go to:
blockId *0 -> shortCircuitCaches[0]
blockId *1 -> shortCircuitCaches[1]
blockId *2 -> shortCircuitCaches[2]
blockId *3 -> shortCircuitCaches[0]
blockId *4 -> shortCircuitCaches[1]
blockId *5 -> shortCircuitCaches[2]
blockId *6 -> shortCircuitCaches[0]
blockId *7 -> shortCircuitCaches[1]
blockId *8 -> shortCircuitCaches[2]
blockId *9 -> shortCircuitCaches[0]

There a little bit more [0] then [1] or [2], but not too much and works good)



was (Author: pustota):
[~leosun08]

Thanks for interest!)
Let me provide more details. 
I added some logging:

{code:java}
  private BlockReader getBlockReaderLocal() throws InvalidToken {
...

[jira] [Comment Edited] (HDFS-15202) HDFS-client: boost ShortCircuit Cache

2020-03-12 Thread Danil Lipovoy (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058076#comment-17058076
 ] 

Danil Lipovoy edited comment on HDFS-15202 at 3/12/20, 4:38 PM:


[~leosun08]

Thanks for interest!)
Let me provide more details. 
I added some logging:

{code:java}
  private BlockReader getBlockReaderLocal() throws InvalidToken {
...
LOG.info("SSC blockId: " + block.getBlockId());
ShortCircuitCache cache = 
clientContext.getShortCircuitCache(block.getBlockId());
{code}

And run reading test via HBase. We can see the log output:

cat hbase-cmf-hbase-REGIONSERVER-wx1122-02.ru.log.out |grep "SSC blockId"
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256236
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251488
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256526
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110252104
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256969
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256965
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256382
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251751
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256871
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251769
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256825
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256241
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256548
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251488
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256808
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110252097
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110257035
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256779
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256331
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251691
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251521
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835
2020-03-12 18:47:32,452 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251691
2020-03-12 18:47:32,455 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251769

We can see that last digit is evenly distributed. There are:
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403 -> 3
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835 -> 5
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256236 -> 6

Then I collected some statistics:

cat hbase-cmf-hbase-REGIONSERVER-wx1122-02.ru.out.2 |grep "SSC blockId"|awk 
'{print substr($0,length,1)}' >> ~/ids.txt
cat ~/ids.txt   | sort | uniq -c | sort -nr | awk '{printf "%-8s%s\n", $2, 
$1}'|sort
0   645813
1   559617
2   532624
3   551147
4   484945
5   465928
6   570635
7   473285
8   525565
9   447981

It means the logs the last digit 0 found 645813 times 
Digit 1 - 559617 times
etc. 
Quite evenly. 
When we divide blockId by modulus the blocks will cached evenly too. For 
example if clientShortCircuitNum = 3, then last digts will go to:
blockId *0 -> shortCircuitCaches[0]
blockId *1 -> shortCircuitCaches[1]
blockId *2 -> shortCircuitCaches[2]
blockId *3 -> shortCircuitCaches[0]
blockId *4 -> shortCircuitCaches[1]
blockId *5 -> shortCircuitCaches[2]
blockId *6 -> shortCircuitCaches[0]
blockId *7 -> shortCircuitCaches[1]
blockId *8 -> shortCircuitCaches[2]
blockId *9 -> shortCircuitCaches[0]

There a little bit more [0] then [1] or [2], but not too much and works good)



was (Author: pustota):
[~leosun08]

Thanks for interest!)
Let me provide more details. 
I added some logging:

{code:java}
  private BlockReader getBlockReaderLocal() throws InvalidToken {
...
*

[jira] [Comment Edited] (HDFS-15202) HDFS-client: boost ShortCircuit Cache

2020-03-12 Thread Danil Lipovoy (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058076#comment-17058076
 ] 

Danil Lipovoy edited comment on HDFS-15202 at 3/12/20, 4:37 PM:


[~leosun08]

Thanks for interest!)
Let me provide more details. 
I added some logging:

{code:java}
  private BlockReader getBlockReaderLocal() throws InvalidToken {
...
*LOG.info("SSC blockId: " + block.getBlockId());*
ShortCircuitCache cache = 
clientContext.getShortCircuitCache(block.getBlockId());
{code}

And run reading test via HBase. We can see the log output:

cat hbase-cmf-hbase-REGIONSERVER-wx1122-02.ru.log.out |grep "SSC blockId"
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256236
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251488
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256526
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110252104
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256969
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256965
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256382
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251751
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256871
2020-03-12 18:47:32,447 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251769
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256825
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256241
2020-03-12 18:47:32,448 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256548
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251488
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256808
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110252097
2020-03-12 18:47:32,449 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110257035
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256779
2020-03-12 18:47:32,450 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256331
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251691
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251521
2020-03-12 18:47:32,451 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835
2020-03-12 18:47:32,452 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251691
2020-03-12 18:47:32,455 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251769

We can see that last digit is evenly distributed. There are:
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256403 -> 3
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110251835 -> 5
2020-03-12 18:47:32,446 INFO org.apache.hadoop.hdfs.BlockReaderFactory: SSC 
blockId: 1110256236 -> 6

Then I collected some statistics:

cat hbase-cmf-hbase-REGIONSERVER-wx1122-02.ru.out.2 |grep "SSC blockId"|awk 
'{print substr($0,length,1)}' >> ~/ids.txt
cat ~/ids.txt   | sort | uniq -c | sort -nr | awk '{printf "%-8s%s\n", $2, 
$1}'|sort
0   645813
1   559617
2   532624
3   551147
4   484945
5   465928
6   570635
7   473285
8   525565
9   447981

It means the logs the last digit 0 found 645813 times 
Digit 1 - 559617 times
etc. 
Quite evenly. 
When we divide blockId by modulus the blocks will cached evenly too. For 
example if clientShortCircuitNum = 3, then last digts will go to:
blockId *0 -> shortCircuitCaches[0]
blockId *1 -> shortCircuitCaches[1]
blockId *2 -> shortCircuitCaches[2]
blockId *3 -> shortCircuitCaches[0]
blockId *4 -> shortCircuitCaches[1]
blockId *5 -> shortCircuitCaches[2]
blockId *6 -> shortCircuitCaches[0]
blockId *7 -> shortCircuitCaches[1]
blockId *8 -> shortCircuitCaches[2]
blockId *9 -> shortCircuitCaches[0]

There a little bit more [0] then [1] or [2], but not too much and works good)



was (Author: pustota):
[~leosun08]

Thanks for interest!)
Let me provide more details. 
I added some logging:

{code:java}
  private BlockReader getBlockReaderLocal() throws InvalidToken {
...
*  

[jira] [Comment Edited] (HDFS-15202) HDFS-client: boost ShortCircuit Cache

2020-03-09 Thread Danil Lipovoy (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054897#comment-17054897
 ] 

Danil Lipovoy edited comment on HDFS-15202 at 3/9/20, 1:32 PM:
---

Done more tests. There were test via HBase. More clean just read direct:


{code:java}
conf.set("dfs.client.read.shortcircuit", "true");
conf.set("dfs.client.read.shortcircuit.buffer.size", "65536"); // by default 1 
Mb and it is bad for performance. By the way maybe decrease default value too?
conf.set("dfs.client.short.circuit.num", num); // from 1 to 10
...
FSDataInputStream in = fileSystem.open(path);
for (int i = 0; i < count; i++) {
position += 65536;
if (position > 9)
position = 0L;
int res = in.read(position, byteBuffer, 0, 65536);
}
{code}

This code is executed in separate threads and we will increase the number of 
simultaneously reading files (from 10 to 200 - the horizontal axis) and the 
number of caches (from 1 to 10 - lines). The vertical axis shows the 
acceleration that gives an increase in SSC relative to the case when there is 
only one cache.

 !hdfs_scc_test_full-cycle.png! 

 !HDFS_CPU_full_cycle.png! 




was (Author: pustota):
Done more tests. There were test via HBase. More clean just read direct:


{code:java}
conf.set("dfs.client.read.shortcircuit", "true");
conf.set("dfs.client.read.shortcircuit.buffer.size", "65536"); // by default 1 
Mb and it is too much
conf.set("dfs.client.short.circuit.num", num); // from 1 to 10
...
FSDataInputStream in = fileSystem.open(path);
for (int i = 0; i < count; i++) {
position += 65536;
if (position > 9)
position = 0L;
int res = in.read(position, byteBuffer, 0, 65536);
}
{code}

This code is executed in separate threads and we will increase the number of 
simultaneously reading files (from 10 to 200 - the horizontal axis) and the 
number of caches (from 1 to 10 - lines). The vertical axis shows the 
acceleration that gives an increase in SSC relative to the case when there is 
only one cache.

 !hdfs_scc_test_full-cycle.png! 

 !HDFS_CPU_full_cycle.png! 



> HDFS-client: boost ShortCircuit Cache
> -
>
> Key: HDFS-15202
> URL: https://issues.apache.org/jira/browse/HDFS-15202
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
> Environment: 4 nodes E5-2698 v4 @ 2.20GHz, 700 Gb Mem.
> 8 RegionServers (2 by host)
> 8 tables by 64 regions by 1.88 Gb data in each = 900 Gb total
> Random read in 800 threads via YCSB and a little bit updates (10% of reads)
>Reporter: Danil Lipovoy
>Assignee: Danil Lipovoy
>Priority: Minor
> Attachments: HDFS_CPU_full_cycle.png, cpu_SSC.png, cpu_SSC2.png, 
> hdfs_cpu.png, hdfs_reads.png, hdfs_scc_3_test.png, 
> hdfs_scc_test_full-cycle.png, locks.png, requests_SSC.png
>
>
> ТотI want to propose how to improve reading performance HDFS-client. The 
> idea: create few instances ShortCircuit caches instead of one. 
> The key points:
> 1. Create array of caches (set by 
> clientShortCircuitNum=*dfs.client.short.circuit.num*, see in the pull 
> requests below):
> {code:java}
> private ClientContext(String name, DfsClientConf conf, Configuration config) {
> ...
> shortCircuitCache = new ShortCircuitCache[this.clientShortCircuitNum];
> for (int i = 0; i < this.clientShortCircuitNum; i++) {
>   this.shortCircuitCache[i] = ShortCircuitCache.fromConf(scConf);
> }
> {code}
> 2 Then divide blocks by caches:
> {code:java}
>   public ShortCircuitCache getShortCircuitCache(long idx) {
> return shortCircuitCache[(int) (idx % clientShortCircuitNum)];
>   }
> {code}
> 3. And how to call it:
> {code:java}
> ShortCircuitCache cache = 
> clientContext.getShortCircuitCache(block.getBlockId());
> {code}
> The last number of offset evenly distributed from 0 to 9 - that's why all 
> caches will full approximately the same.
> It is good for performance. Below the attachment, it is load test reading 
> HDFS via HBase where clientShortCircuitNum = 1 vs 3. We can see that 
> performance grows ~30%, CPU usage about +15%. 
> Hope it is interesting for someone.
> Ready to explain some unobvious things.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15202) HDFS-client: boost ShortCircuit Cache

2020-03-09 Thread Danil Lipovoy (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054902#comment-17054902
 ] 

Danil Lipovoy edited comment on HDFS-15202 at 3/9/20, 12:59 PM:


How to read the picture: the execution time of 100 thousand readings in blocks 
of 64 KB with one cache (SSC=1) take 78 seconds. When SCC=5 caches this is done 
in 17 seconds. It means acceleration of 4.5 times. 

The best results relative CPU utilization when SCC=2 or SCC=3.
If we compare SCC=3 increase performance = ~3.3 times in average but CPU 
utilization only ~2.8.
 !hdfs_scc_3_test.png! 


was (Author: pustota):
How to read the picture: the execution time of 100 thousand readings in blocks 
of 64 KB with one cache (SSC=1) take 78 seconds. When SCC=5 caches this is done 
in 17 seconds. It means acceleration of 4.5 times. 

The best results relative CPU utilization when SCC=2 or SCC=3.
If we compare SCC=3 increase performance = ~3.3 in average but CPU utilization 
only ~2.8.
 !hdfs_scc_3_test.png! 

> HDFS-client: boost ShortCircuit Cache
> -
>
> Key: HDFS-15202
> URL: https://issues.apache.org/jira/browse/HDFS-15202
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
> Environment: 4 nodes E5-2698 v4 @ 2.20GHz, 700 Gb Mem.
> 8 RegionServers (2 by host)
> 8 tables by 64 regions by 1.88 Gb data in each = 900 Gb total
> Random read in 800 threads via YCSB and a little bit updates (10% of reads)
>Reporter: Danil Lipovoy
>Assignee: Danil Lipovoy
>Priority: Minor
> Attachments: HDFS_CPU_full_cycle.png, cpu_SSC.png, cpu_SSC2.png, 
> hdfs_cpu.png, hdfs_reads.png, hdfs_scc_3_test.png, 
> hdfs_scc_test_full-cycle.png, locks.png, requests_SSC.png
>
>
> ТотI want to propose how to improve reading performance HDFS-client. The 
> idea: create few instances ShortCircuit caches instead of one. 
> The key points:
> 1. Create array of caches (set by 
> clientShortCircuitNum=*dfs.client.short.circuit.num*, see in the pull 
> requests below):
> {code:java}
> private ClientContext(String name, DfsClientConf conf, Configuration config) {
> ...
> shortCircuitCache = new ShortCircuitCache[this.clientShortCircuitNum];
> for (int i = 0; i < this.clientShortCircuitNum; i++) {
>   this.shortCircuitCache[i] = ShortCircuitCache.fromConf(scConf);
> }
> {code}
> 2 Then divide blocks by caches:
> {code:java}
>   public ShortCircuitCache getShortCircuitCache(long idx) {
> return shortCircuitCache[(int) (idx % clientShortCircuitNum)];
>   }
> {code}
> 3. And how to call it:
> {code:java}
> ShortCircuitCache cache = 
> clientContext.getShortCircuitCache(block.getBlockId());
> {code}
> The last number of offset evenly distributed from 0 to 9 - that's why all 
> caches will full approximately the same.
> It is good for performance. Below the attachment, it is load test reading 
> HDFS via HBase where clientShortCircuitNum = 1 vs 3. We can see that 
> performance grows ~30%, CPU usage about +15%. 
> Hope it is interesting for someone.
> Ready to explain some unobvious things.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15202) HDFS-client: boost ShortCircuit Cache

2020-03-07 Thread Danil Lipovoy (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054021#comment-17054021
 ] 

Danil Lipovoy edited comment on HDFS-15202 at 3/7/20, 12:33 PM:


Done more tests. Moved out generators on others hosts. 
And added more threads (by 100) every 5 minutes. 
It is help to see were the limits of performance.
And now CPU usage grows a little bit more like performance (30% vs 25%)
If someone need more speed reading more then CPU they can enable it.
!requests_SSC.png! 


was (Author: pustota):
Done more tests. Moved out generators on others hosts. 
And added more threads (by 100) every 5 minutes. 
It is help to see were the limits of performance.
And now CPU usage grows like performance about +30%
!cpu_SSC.png! 
!requests_SSC.png! 

> HDFS-client: boost ShortCircuit Cache
> -
>
> Key: HDFS-15202
> URL: https://issues.apache.org/jira/browse/HDFS-15202
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
> Environment: 4 nodes E5-2698 v4 @ 2.20GHz, 700 Gb Mem.
> 8 RegionServers (2 by host)
> 8 tables by 64 regions by 1.88 Gb data in each = 900 Gb total
> Random read in 800 threads via YCSB and a little bit updates (10% of reads)
>Reporter: Danil Lipovoy
>Assignee: Danil Lipovoy
>Priority: Minor
> Attachments: cpu_SSC.png, hdfs_cpu.png, hdfs_reads.png, 
> requests_SSC.png
>
>
> ТотI want to propose how to improve reading performance HDFS-client. The 
> idea: create few instances ShortCircuit caches instead of one. 
> The key points:
> 1. Create array of caches (set by 
> clientShortCircuitNum=*dfs.client.short.circuit.num*, see in the pull 
> requests below):
> {code:java}
> private ClientContext(String name, DfsClientConf conf, Configuration config) {
> ...
> shortCircuitCache = new ShortCircuitCache[this.clientShortCircuitNum];
> for (int i = 0; i < this.clientShortCircuitNum; i++) {
>   this.shortCircuitCache[i] = ShortCircuitCache.fromConf(scConf);
> }
> {code}
> 2 Then divide blocks by caches:
> {code:java}
>   public ShortCircuitCache getShortCircuitCache(long idx) {
> return shortCircuitCache[(int) (idx % clientShortCircuitNum)];
>   }
> {code}
> 3. And how to call it:
> {code:java}
> ShortCircuitCache cache = 
> clientContext.getShortCircuitCache(block.getBlockId());
> {code}
> The last number of offset evenly distributed from 0 to 9 - that's why all 
> caches will full approximately the same.
> It is good for performance. Below the attachment, it is load test reading 
> HDFS via HBase where clientShortCircuitNum = 1 vs 3. We can see that 
> performance grows ~30%, CPU usage about +15%. 
> Hope it is interesting for someone.
> Ready to explain some unobvious things.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org