[ 
https://issues.apache.org/jira/browse/HDDS-10465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sammi Chen updated HDDS-10465:
------------------------------
    Description: 
 

When using TestDFSIO to compare the random read performance of HDFS and Ozone, 
Ozone is way more slow than HDFS. Here are the data tested in YCloud cluster.

Test Suit: TestDFSIO

Number of files: 64

File Size: 1024MB

 
||Random read(execution time)||Round1(s)||Round2(s)||
|HDFS| 47.06|49.5|
|Ozone|147.31|149.47|

And for Ozone itself, sequence read is must faster than random read:
||Ozone||Round1(s)||Round2(s)||Round3(s)||
|read execution time|66.62|58.78|68.98|
|random read 
execution time|147.31|149.47|147.09|

While for HDFS, there is no much gap between its sequence read and random read 
execution time:
||HDFS||Round1(s)||Round2(s)||
|read execution time|51.53|44.88|
|random read 
execution time|47.06|49.5|

After some investigation, it's found that the total bytes read from DN in 
TestDFSIO random read test is almost double the data size. Here the total data 
to read is 64 * 1024MB = 64GB, while the aggregated DN bytesReadChunk metric 
value is increased by 128GB after one test run. The root cause is when client 
reads data, it will align the requested data size with 
"ozone.client.bytes.per.checksum" which is 1MB currently.  For example, if 
reading 1 byte, client will send request to DN to fetch 1MB data. If reading 2 
bytes, but these 2 byte's offsets are cross the 1MB boundary, then client will 
send request to DN to fetch the first 1MB for first byte data, and the second 
1MB for second byte data. In the random read mode, TestDFSIO use a read buffer 
with size 1000000 = 976.5KB, that's why the total bytes read from DN is double 
the size.

According, HDFS uses property "file.bytes-per-checksum", which is 512 bytes by 
default.

To improve the Ozone random read performance, a straightforward idea is to use 
a smaller "ozone.client.bytes.per.checksum" default value. Here we tested 1MB, 
16KB and 8KB, get the data using TestDFSIO(64 files, each 512MB)

 
||ozone.client.bytes
.per.checksum||write1(s)||write2(s)||write3(s)||read1(s)||read2(s)||read3(s)||read
average||random
read1||random
read2||random
read3||random
average||
|1MB|163.01|163.34|141.9|47.25|51.86|52.02|50.28|114.42|90.38|97.83|100.88|
|16KB|160.6|144.43|165.08|63.36|67.68|69.94|66.89|55.94|72.14|55.43|61.17|
|8KB|149.97|161.01|161.57|66.46|61.61|63.17|63.75|62.06|71.93|58.56|64.18|

 

>From the above data, we can see that for same amount of data
 * write, the execution time have no obvious differences in all there cases
 * sequential read, 1MB bytes.per.checksum has best execution time.  16KB and 
8KB has the close execution time.
 * random read, 1MB has the worst execution time. 16KB and 8KB has the close 
execution time.
 * For either 16KB or 8KB bytes.per.checksum, their sequential read and random 
read has close execution time, similar to HDFS behavior.

Change bytes.per.checksum from 1MB to 16KB, although the sequential read 
performance will drop a bit, the the performance gain in random read is much 
higher than that.

So this task propose to change the ozone.client.file.bytes-per-checksum default 
value from current 1MB to 16KB, and lower the current min limit of the property 
from 16KB to 8KB, to improve the overall read performance.

  was:
 

When using TestDFSIO to compare the random read performance of HDFS and Ozone, 
Ozone is way more slow than HDFS. Here are the data tested in YCloud cluster.

Test Suit: TestDFSIO

Number of files: 64

File Size: 1024MB

 
||Random read(execution time)||Round1(s)||Round2(s)||
|HDFS| 47.06|49.5|
|Ozone|147.31|149.47|

And for Ozone itself, sequence read is must faster than random read:
||Ozone||Round1(s)||Round2(s)||Round3(s)||
|read execution time|66.62|58.78|68.98|
|random read 
execution time|147.31|149.47|147.09|

While for HDFS, there is no much gap between its sequence read and random read 
execution time:
||HDFS||Round1(s)||Round2(s)||
|read execution time|51.53|44.88|
|random read 
execution time|47.06|49.5|

After some investigation, it's found that the total bytes read from DN in 
TestDFSIO random read test is almost double the data size. Here the total data 
to read is 64 * 1024MB = 64GB, why the aggregated DN bytesReadChunk metric 
value is increased by 128GB after one test run. The root cause if when client 
read data, it will align the requested data size with 
"ozone.client.bytes.per.checksum" which is 1MB currently.  For example, if read 
1 byte, client will still send request to DN to fetch 1MB data. If read 2 
bytes, but these 2 byte's offsets are cross the 1MB boundary, then client will 
send request to DN to fetch the first 1MB for first byte data, and the second 
1MB for second byte data. In the random read mode, TestDFSIO use a read buffer 
with size 1000000, that's 976.5KB, that's why the total bytes read from DN is 
double the size.

According, HDFS uses property "file.bytes-per-checksum", which is 512 bytes by 
default.

To improve the Ozone random read performance, a straightforward idea is to use 
a smaller "ozone.client.bytes.per.checksum" default value. Here we tested 1MB, 
16KB and 8KB, get the data using TestDFSIO(64 files, each 512MB)

 
||ozone.client.bytes
.per.checksum||write1(s)||write2(s)||write3(s)||read1(s)||read2(s)||read3(s)||read
average||random
read1||random
read2||random
read3||random
average||
|1MB|163.01|163.34|141.9|47.25|51.86|52.02|50.28|114.42|90.38|97.83|100.88|
|16KB|160.6|144.43|165.08|63.36|67.68|69.94|66.89|55.94|72.14|55.43|61.17|
|8KB|149.97|161.01|161.57|66.46|61.61|63.17|63.75|62.06|71.93|58.56|64.18|

 

 

>From the above data, we can see that for same amount of data
 * write, the execution time have no obvious differences in all there cases
 * sequential read, 1MB bytes.per.checksum has best execution time.  16KB and 
8KB has the close execution time.
 * random read, 1MB has the worst execution time. 16KB and 8KB has the close 
execution time.
 * For either 16KB or 8KB bytes.per.checksum, their sequential read and random 
read has close execution time, similar to HDFS behavior.

This task propose to change the ozone.client.file.bytes-per-checksum default 
value from current 1MB to 16KB, and lower the current min limit of the property 
from 16KB to 8KB.


> Change ozone.client.bytes.per.checksum default to 16KB
> ------------------------------------------------------
>
>                 Key: HDDS-10465
>                 URL: https://issues.apache.org/jira/browse/HDDS-10465
>             Project: Apache Ozone
>          Issue Type: Improvement
>            Reporter: Sammi Chen
>            Assignee: Sammi Chen
>            Priority: Major
>              Labels: pull-request-available
>
>  
> When using TestDFSIO to compare the random read performance of HDFS and 
> Ozone, Ozone is way more slow than HDFS. Here are the data tested in YCloud 
> cluster.
> Test Suit: TestDFSIO
> Number of files: 64
> File Size: 1024MB
>  
> ||Random read(execution time)||Round1(s)||Round2(s)||
> |HDFS| 47.06|49.5|
> |Ozone|147.31|149.47|
> And for Ozone itself, sequence read is must faster than random read:
> ||Ozone||Round1(s)||Round2(s)||Round3(s)||
> |read execution time|66.62|58.78|68.98|
> |random read 
> execution time|147.31|149.47|147.09|
> While for HDFS, there is no much gap between its sequence read and random 
> read execution time:
> ||HDFS||Round1(s)||Round2(s)||
> |read execution time|51.53|44.88|
> |random read 
> execution time|47.06|49.5|
> After some investigation, it's found that the total bytes read from DN in 
> TestDFSIO random read test is almost double the data size. Here the total 
> data to read is 64 * 1024MB = 64GB, while the aggregated DN bytesReadChunk 
> metric value is increased by 128GB after one test run. The root cause is when 
> client reads data, it will align the requested data size with 
> "ozone.client.bytes.per.checksum" which is 1MB currently.  For example, if 
> reading 1 byte, client will send request to DN to fetch 1MB data. If reading 
> 2 bytes, but these 2 byte's offsets are cross the 1MB boundary, then client 
> will send request to DN to fetch the first 1MB for first byte data, and the 
> second 1MB for second byte data. In the random read mode, TestDFSIO use a 
> read buffer with size 1000000 = 976.5KB, that's why the total bytes read from 
> DN is double the size.
> According, HDFS uses property "file.bytes-per-checksum", which is 512 bytes 
> by default.
> To improve the Ozone random read performance, a straightforward idea is to 
> use a smaller "ozone.client.bytes.per.checksum" default value. Here we tested 
> 1MB, 16KB and 8KB, get the data using TestDFSIO(64 files, each 512MB)
>  
> ||ozone.client.bytes
> .per.checksum||write1(s)||write2(s)||write3(s)||read1(s)||read2(s)||read3(s)||read
> average||random
> read1||random
> read2||random
> read3||random
> average||
> |1MB|163.01|163.34|141.9|47.25|51.86|52.02|50.28|114.42|90.38|97.83|100.88|
> |16KB|160.6|144.43|165.08|63.36|67.68|69.94|66.89|55.94|72.14|55.43|61.17|
> |8KB|149.97|161.01|161.57|66.46|61.61|63.17|63.75|62.06|71.93|58.56|64.18|
>  
> From the above data, we can see that for same amount of data
>  * write, the execution time have no obvious differences in all there cases
>  * sequential read, 1MB bytes.per.checksum has best execution time.  16KB and 
> 8KB has the close execution time.
>  * random read, 1MB has the worst execution time. 16KB and 8KB has the close 
> execution time.
>  * For either 16KB or 8KB bytes.per.checksum, their sequential read and 
> random read has close execution time, similar to HDFS behavior.
> Change bytes.per.checksum from 1MB to 16KB, although the sequential read 
> performance will drop a bit, the the performance gain in random read is much 
> higher than that.
> So this task propose to change the ozone.client.file.bytes-per-checksum 
> default value from current 1MB to 16KB, and lower the current min limit of 
> the property from 16KB to 8KB, to improve the overall read performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to