Constantin-Catalin Luca created HBASE-24541:
-----------------------------------------------
Summary: Add support to run LoadIncrementalHFiles in a distributed
manner
Key: HBASE-24541
URL: https://issues.apache.org/jira/browse/HBASE-24541
Project: HBase
Issue Type: Improvement
Components: mapreduce, Performance
Affects Versions: 1.4.0
Reporter: Constantin-Catalin Luca
LoadIncrementalHFiles takes a very long time to complete when running HBase on
top of S3 and attempting to bulkload 500K-700K files.
The root cause of this is a combination of the higher latency of S3 (as
compared to HDFS) as well as the calls made by LoadIncrementalHFiles to the
underlying filesystem(each file is opened, seeked to the trailer offset at the
end, and then the trailer is read).
Increasing the parallelism does not yield any significant improvement. This
seems to stem from the fact that once the trailer is read the stream is not
consumed to the end. This causes the underlying HTTP connection to be aborted
and it cannot be re-used.
The proposed solution would be to also add support to run LoadIncrementalHFiles
on multiple machines as a map reduce job.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)