[
https://issues.apache.org/jira/browse/HBASE-26273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413293#comment-17413293
]
Josh Elser commented on HBASE-26273:
------------------------------------
{quote}Can you explain why there are more HDFS connection with PREAD than
STREAM? Thanks.
{quote}
Probably not a good way to phrase it on my part :). What I meant to point out
is that, if you're doing mapreduce over Snapshots, you most likely are reading
most/all of the HFile. The seek+read we do for every pread seems excessive to
me (where we can instead just keep reading forward like normal).
This is also related to the other issue Stephen filed: HBASE-26274 (where we
_did_ make a lot more connections to HDFS because we kept having to go back and
re-read the index blocks)
> TableSnapshotInputFormat/TableSnapshotInputFormatImpl should use
> ReadType.STREAM for scanning HFiles
> -----------------------------------------------------------------------------------------------------
>
> Key: HBASE-26273
> URL: https://issues.apache.org/jira/browse/HBASE-26273
> Project: HBase
> Issue Type: Improvement
> Components: mapreduce
> Affects Versions: 3.0.0-alpha-1, 2.4.6
> Reporter: Tak-Lon (Stephen) Wu
> Assignee: Josh Elser
> Priority: Major
>
> After the change in HBASE-17917 that use PREAD ({{ReadType.DEFAULT}}) for all
> user scan, the behavior of TableSnapshotInputFormat changed from STREAM to
> PREAD.
> TableSnapshotInputFormat is supposed to be use with a YARN/MR or other batch
> engine that should read the entire HFile in the container/executor, with
> default always to PREAD, the number of connection to HDFS surges and has an
> side-effect on the overall performance.
> The goal of this change is to make any downstream using
> TableSnapshotInputFormat with STREAM scan.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)