[ https://issues.apache.org/jira/browse/HBASE-27013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533995#comment-17533995 ]
Josh Elser commented on HBASE-27013: ------------------------------------ {quote}So the problem here is, the implementation of S3A is not HDFS, we can not reuse the stream to send multiple pread requests with random offset. Seems not like a good enough pread implementation... {quote} Yeah, s3a != hdfs is definitely a major pain point. IIUC, HBase nor HDFS are doing anything wrong, per se. HDFS just happens to handle this super fast and s3a... doesn't. {quote}In general, in pread mode, a FSDataInputStream may be used by different read requests so even if you fixed this problem, it could still introduce a lot of aborts as different read request may read from different offsets... {quote} Right again – focus being put on prefetching as we know that once hfiles are cached, things are super fast. Thus, this is the first problem to chase. However, any operations over a table which isn't fully cache would end up over-reading from s3. I had thought about whether we just write a custom Reader for the prefetch case, but then we wouldn't address the rest of the access paths (e.g. scans). Stephen's worst case numbers are still ~130MB/s to pull down HFiles from S3 to cache which is good on the surface, but not so good when you compare to the closer to 1GB/s that you can get through awscli (and whatever their parallelized downloader was called). One optimization at a time :) > Introduce read all bytes when using pread for prefetch > ------------------------------------------------------ > > Key: HBASE-27013 > URL: https://issues.apache.org/jira/browse/HBASE-27013 > Project: HBase > Issue Type: Improvement > Components: HFile, Performance > Affects Versions: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.13 > Reporter: Tak-Lon (Stephen) Wu > Assignee: Tak-Lon (Stephen) Wu > Priority: Major > > h2. Problem statement > When prefetching HFiles from blob storage like S3 and use it with the storage > implementation like S3A, we found there is a logical issue in HBase pread > that causes the reading of the remote HFile aborts the input stream multiple > times. This aborted stream and reopen slow down the reads and trigger many > aborted bytes and waste time in recreating the connection especially when SSL > is enabled. > h2. ROOT CAUSE > The root cause of above issue was due to > [BlockIOUtils#preadWithExtra|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/BlockIOUtils.java#L214-L257] > is reading an input stream that does not guarrentee to return the data block > and the next block header as an option data to be cached. > In the case of the input stream read short and when the input stream read > passed the length of the necessary data block with few more bytes within the > size of next block header, the > [BlockIOUtils#preadWithExtra|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/BlockIOUtils.java#L214-L257] > returns to the caller without a cached the next block header. As a result, > before HBase tries to read the next block, > [HFileBlock#readBlockDataInternal|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java#L1648-L1664] > in hbase tries to re-read the next block header from the input stream. Here, > the reusable input stream has move the current position pointer ahead from > the offset of the last read data block, when using with the [S3A > implementation|https://github.com/apache/hadoop/blob/29401c820377d02a992eecde51083cf87f8e57af/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L339-L361], > the input stream is then closed, aborted all the remaining bytes and reopen > a new input stream at the offset of the last read data block . > h2. How do we fix it? > S3A is doing the right job that HBase is telling to move the offset from > position A back to A - N, so there is not much thing we can do on how S3A > handle the inputstream. meanwhile in the case of HDFS, this operation is fast. > Such that, we should fix in HBase level, and try always to read datablock + > next block header when we're using blob storage to avoid expensive draining > the bytes in a stream and reopen the socket with the remote storage. > h2. Draw back and discussion > * A known drawback is, when we're at the last block, we will read extra > length that should not be a header, and we still read that into the byte > buffer array. the size should be always 33 bytes, and it should not a big > issue in data correctness because the trailer will tell when the last > datablock should end. And we just waste a 33 byte read and that data is not > being used. > * I don't know if we can use HFileStreamReader but that will change the > Prefetch logic a lot, such that this minimum change should be the best. > h2. initial result > We use YCSB 1 billion records data, and we enable prefetch for the userable. > the collected the S3A metrics of {{stream_read_bytes_discarded_in_abort}} to > compare the solution, each region server have abort ~290 GB data to be > prefetch to bucketcache. > * before the change, we have a total of 4235973338472 bytes (~4235GB) has > been aborted on a sample region server for about 290GB data. > ** the overall time was about 45 ~ 60 mins > > {code} > % grep "stream_read_bytes_discarded_in_abort" > ~/prefetch-result/prefetch-s3a-jmx-metrics.json | grep -wv > "stream_read_bytes_discarded_in_abort\":0," > "stream_read_bytes_discarded_in_abort":3136854553, > "stream_read_bytes_discarded_in_abort":19119241, > "stream_read_bytes_discarded_in_abort":2131591701471, > "stream_read_bytes_discarded_in_abort":150484654298, > "stream_read_bytes_discarded_in_abort":106536641550, > "stream_read_bytes_discarded_in_abort":1785264521717, > "stream_read_bytes_discarded_in_abort":58939845642, > {code} > * After the change, we only have 87100225454 bytes (~87GB) data to be aborted. > ** the reason is about the position is way behind the asked target position, > then S3A reopen the stream and move the position to the current offset. This > is a different problem we will need to look into later. > ** the overall time is then cut to 30~38 mins, about 30% faster. > {code} > % grep "stream_read_bytes_discarded_in_abort" ~/fixed-formatted-jmx2.json > "stream_read_bytes_discarded_in_abort": 0, > "stream_read_bytes_discarded_in_abort": 87100225454, > "stream_read_bytes_discarded_in_abort": 67043088, > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)