[jira] [Comment Edited] (HBASE-27013) Introduce read all bytes when using pread for prefetch

Tak-Lon (Stephen) Wu (Jira) Mon, 09 May 2022 15:25:08 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-27013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534042#comment-17534042
 ]


Tak-Lon (Stephen) Wu edited comment on HBASE-27013 at 5/9/22 10:24 PM:
-----------------------------------------------------------------------

bq. we can not reuse the stream to send multiple pread requests with random 
offset

the concept of reuse the same stream is about how much it read a head 
(readahead range) from a single/current HTTP call to the object store, e.g. S3. 
If seek/pread ask the the range that has been already read ahead from the HTTP 
response, we don't need to reopen a new HTTP to maintain the streaming data. in 
other words, it's a different type of streaming implementation that based on a 
HTTP connection to the blob storage. the problem of this prefetch is that, if 
we're using {{fs.s3a.experimental.input.fadvise=sequential}} as that we have 
read a lot of data from the remote data into a local buffer, we don't want to 
completely drain and abort the connection. (meanwhile, we knew that 
{{fs.s3a.experimental.input.fadvise=random}}) can read small data into buffer 
one at a time but it's slower a lot)

bq. Seems not like a good enough pread implementation

I would say we're using the HDFS semantic with blob storage like S3A, such that 
we're doing interesting thing for any supported blob storage. HDFS, as Josh 
also pointed out, is just faster a lot than any file system implementation 
written for blob storage 

bq.  FSDataInputStream may be used by different read requests so even if you 
fixed this problem, it could still introduce a lot of aborts as different read 
request may read from different offsets...

So, it won't introduce other aborts when reading other offsets because the 
problem we're facing in this JIRA only for Prefetch, and I should have proved 
that in my prototype. To view it in the technical way, it's the other way 
around that, we're closing and aborting as of today without my change. 

Where the improvement of this JIRA is only about a Store open or prefetch when 
Store Open, the actual usage is to customize the prefetch (via Store File 
Manager) with the proposed configuration ({{hfile.pread.all.bytes.enabled}}) 
during the store is opening and use this optional read all bytes feature. (but 
don't provide this store file manager because this option is disabled by 
default)

To sum, if we're introducing a lot of aborts, then I think our implementation 
isn't right but I still don't find a case that we can introduce abort if we're 
reading the extra header that is part of the data block of the HFile 


was (Author: taklwu):
bq. we can not reuse the stream to send multiple pread requests with random 
offset

the concept of reuse the same stream is about how much it read a head 
(readahead range) from a single/current HTTP call to the object store, e.g. S3. 
If seek/pread ask the the range that has been already read ahead from the HTTP 
response, we don't need to reopen a new HTTP to maintain the streaming data. in 
other words, it's a different type of streaming implementation that based on a 
HTTP connection to the blob storage. the problem of this prefetch is that, if 
we're using {{fs.s3a.experimental.input.fadvise=sequential}} as that we have 
read a lot of data from the remote data into a local buffer, we don't want to 
completely drain and abort the connection. (meanwhile, we knew that 
{{fs.s3a.experimental.input.fadvise=random}}) can read small data into buffer 
one at a time but it's slower a lot)

bq. Seems not like a good enough pread implementation

I would say we're using the HDFS semantic with blob storage like S3A, such that 
we're doing interesting thing for any supported blob storage. HDFS, as Josh 
also pointed out, is just faster a lot than any file system implementation 
written for blob storage 

> FSDataInputStream may be used by different read requests so even if you fixed 
> this problem, it could still introduce a lot of aborts as different read 
> request may read from different offsets...

So, it won't introduce other aborts when reading other offsets because the 
problem we're facing in this JIRA only for Prefetch, and I should have proved 
that in my prototype. To view it in the technical way, it's the other way 
around that, we're closing and aborting as of today without my change. 

Where the improvement of this JIRA is only about a Store open or prefetch when 
Store Open, the actual usage is to customize the prefetch (via Store File 
Manager) with the proposed configuration ({{hfile.pread.all.bytes.enabled}}) 
during the store is opening and use this optional read all bytes feature. (but 
don't provide this store file manager because this option is disabled by 
default)

To sum, if we're introducing a lot of aborts, then I think our implementation 
isn't right but I still don't find a case that we can introduce abort if we're 
reading the extra header that is part of the data block of the HFile 

> Introduce read all bytes when using pread for prefetch
> ------------------------------------------------------
>
>                 Key: HBASE-27013
>                 URL: https://issues.apache.org/jira/browse/HBASE-27013
>             Project: HBase
>          Issue Type: Improvement
>          Components: HFile, Performance
>    Affects Versions: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.13
>            Reporter: Tak-Lon (Stephen) Wu
>            Assignee: Tak-Lon (Stephen) Wu
>            Priority: Major
>
> h2. Problem statement
> When prefetching HFiles from blob storage like S3 and use it with the storage 
> implementation like S3A, we found there is a logical issue in HBase pread 
> that causes the reading of the remote HFile aborts the input stream multiple 
> times. This aborted stream and reopen slow down the reads and trigger many 
> aborted bytes and waste time in recreating the connection especially when SSL 
> is enabled.
> h2. ROOT CAUSE
> The root cause of above issue was due to 
> [BlockIOUtils#preadWithExtra|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/BlockIOUtils.java#L214-L257]
>  is reading an input stream that does not guarrentee to return the data block 
> and the next block header as an option data to be cached.
> In the case of the input stream read short and when the input stream read 
> passed the length of the necessary data block with few more bytes within the 
> size of next block header, the 
> [BlockIOUtils#preadWithExtra|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/BlockIOUtils.java#L214-L257]
>  returns to the caller without a cached the next block header. As a result, 
> before HBase tries to read the next block, 
> [HFileBlock#readBlockDataInternal|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java#L1648-L1664]
>  in hbase tries to re-read the next block header from the input stream. Here, 
> the reusable input stream has move the current position pointer ahead from 
> the offset of the last read data block, when using with the [S3A 
> implementation|https://github.com/apache/hadoop/blob/29401c820377d02a992eecde51083cf87f8e57af/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L339-L361],
>  the input stream is then closed, aborted all the remaining bytes and reopen 
> a new input stream at the offset of the last read data block .
> h2. How do we fix it?
> S3A is doing the right job that HBase is telling to move the offset from 
> position A back to A - N, so there is not much thing we can do on how S3A 
> handle the inputstream. meanwhile in the case of HDFS, this operation is fast.
> Such that, we should fix in HBase level, and try always to read datablock + 
> next block header when we're using blob storage to avoid expensive draining 
> the bytes in a stream and reopen the socket with the remote storage.
> h2. Draw back and discussion
>  * A known drawback is, when we're at the last block, we will read extra 
> length that should not be a header, and we still read that into the byte 
> buffer array. the size should be always 33 bytes, and it should not a big 
> issue in data correctness because the trailer will tell when the last 
> datablock should end. And we just waste a 33 byte read and that data is not 
> being used.
>  * I don't know if we can use HFileStreamReader but that will change the 
> Prefetch logic a lot, such that this minimum change should be the best.
> h2. initial result
> We use YCSB 1 billion records data, and we enable prefetch for the userable. 
> the collected the S3A metrics of {{stream_read_bytes_discarded_in_abort}} to 
> compare the solution, each region server have abort ~290 GB data to be 
> prefetch to bucketcache.
> * before the change, we have a total of 4235973338472 bytes (~4235GB) has 
> been aborted on a sample region server for about 290GB data.
> ** the overall time was about 45 ~ 60 mins
>  
> {code}
> % grep "stream_read_bytes_discarded_in_abort" 
> ~/prefetch-result/prefetch-s3a-jmx-metrics.json | grep -wv 
> "stream_read_bytes_discarded_in_abort\":0,"
>          "stream_read_bytes_discarded_in_abort":3136854553,
>          "stream_read_bytes_discarded_in_abort":19119241,
>          "stream_read_bytes_discarded_in_abort":2131591701471,
>          "stream_read_bytes_discarded_in_abort":150484654298,
>          "stream_read_bytes_discarded_in_abort":106536641550,
>          "stream_read_bytes_discarded_in_abort":1785264521717,
>          "stream_read_bytes_discarded_in_abort":58939845642,
> {code}
> * After the change, we only have 87100225454 bytes (~87GB) data to be aborted.
> ** the reason is about the position is way behind the asked target position, 
> then S3A reopen the stream and move the position to the current offset. This 
> is a different problem we will need to look into later. 
> ** the overall time is then cut to 30~38 mins, about 30% faster.
> {code}
> % grep "stream_read_bytes_discarded_in_abort" ~/fixed-formatted-jmx2.json
>       "stream_read_bytes_discarded_in_abort": 0,
>       "stream_read_bytes_discarded_in_abort": 87100225454,
>       "stream_read_bytes_discarded_in_abort": 67043088,
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Comment Edited] (HBASE-27013) Introduce read all bytes when using pread for prefetch

Reply via email to