[
https://issues.apache.org/jira/browse/HBASE-27013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18023874#comment-18023874
]
Steve Loughran commented on HBASE-27013:
----------------------------------------
fs.s3a.experimental.fadvise = sequential isn't the best policy here.
> Introduce read all bytes when using pread for prefetch
> ------------------------------------------------------
>
> Key: HBASE-27013
> URL: https://issues.apache.org/jira/browse/HBASE-27013
> Project: HBase
> Issue Type: Improvement
> Components: HFile, Performance
> Affects Versions: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.13
> Reporter: Tak-Lon (Stephen) Wu
> Assignee: Tak-Lon (Stephen) Wu
> Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-3
>
>
> h2. Problem statement
> When prefetching HFiles from blob storage like S3 and use it with the storage
> implementation like S3A, we found there is a logical issue in HBase pread
> that causes the reading of the remote HFile aborts the input stream multiple
> times. This aborted stream and reopen slow down the reads and trigger many
> aborted bytes and waste time in recreating the connection especially when SSL
> is enabled.
> h2. ROOT CAUSE
> The root cause of above issue was due to
> [BlockIOUtils#preadWithExtra|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/BlockIOUtils.java#L214-L257]
> is reading an input stream that does not guarrentee to return the data block
> and the next block header as an option data to be cached.
> In the case of the input stream read short and when the input stream read
> passed the length of the necessary data block with few more bytes within the
> size of next block header, the
> [BlockIOUtils#preadWithExtra|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/BlockIOUtils.java#L214-L257]
> returns to the caller without a cached the next block header. As a result,
> before HBase tries to read the next block,
> [HFileBlock#readBlockDataInternal|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java#L1648-L1664]
> in hbase tries to re-read the next block header from the input stream. Here,
> the reusable input stream has move the current position pointer ahead from
> the offset of the last read data block, when using with the [S3A
> implementation|https://github.com/apache/hadoop/blob/29401c820377d02a992eecde51083cf87f8e57af/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L339-L361],
> the input stream is then closed, aborted all the remaining bytes and reopen
> a new input stream at the offset of the last read data block .
> h2. How do we fix it?
> S3A is doing the right job that HBase is telling to move the offset from
> position A back to A - N, so there is not much thing we can do on how S3A
> handle the inputstream. meanwhile in the case of HDFS, this operation is fast.
> Such that, we should fix in HBase level, and try always to read datablock +
> next block header when we're using blob storage to avoid expensive draining
> the bytes in a stream and reopen the socket with the remote storage.
> h2. Draw back and discussion
> * A known drawback is, when we're at the last block, we will read extra
> length that should not be a header, and we still read that into the byte
> buffer array. the size should be always 33 bytes, and it should not a big
> issue in data correctness because the trailer will tell when the last
> datablock should end. And we just waste a 33 byte read and that data is not
> being used.
> * I don't know if we can use HFileStreamReader but that will change the
> Prefetch logic a lot, such that this minimum change should be the best.
> h2. initial result
> We use YCSB 1 billion records data, and we enable prefetch for the userable.
> the collected the S3A metrics of {{stream_read_bytes_discarded_in_abort}} to
> compare the solution, each region server have abort ~290 GB data to be
> prefetch to bucketcache.
> * before the change, we have a total of 4235973338472 bytes (~4235GB) has
> been aborted on a sample region server for about 290GB data.
> ** the overall time was about 45 ~ 60 mins
>
> {code}
> % grep "stream_read_bytes_discarded_in_abort"
> ~/prefetch-result/prefetch-s3a-jmx-metrics.json | grep -wv
> "stream_read_bytes_discarded_in_abort\":0,"
> "stream_read_bytes_discarded_in_abort":3136854553,
> "stream_read_bytes_discarded_in_abort":19119241,
> "stream_read_bytes_discarded_in_abort":2131591701471,
> "stream_read_bytes_discarded_in_abort":150484654298,
> "stream_read_bytes_discarded_in_abort":106536641550,
> "stream_read_bytes_discarded_in_abort":1785264521717,
> "stream_read_bytes_discarded_in_abort":58939845642,
> {code}
> * After the change, we only have 87100225454 bytes (~87GB) data to be aborted.
> ** the reason is about the position is way behind the asked target position,
> then S3A reopen the stream and move the position to the current offset. This
> is a different problem we will need to look into later.
> ** the overall time is then cut to 30~38 mins, about 30% faster.
> {code}
> % grep "stream_read_bytes_discarded_in_abort" ~/fixed-formatted-jmx2.json
> "stream_read_bytes_discarded_in_abort": 0,
> "stream_read_bytes_discarded_in_abort": 87100225454,
> "stream_read_bytes_discarded_in_abort": 67043088,
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)