[
https://issues.apache.org/jira/browse/HBASE-24541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Constantin-Catalin Luca updated HBASE-24541:
--------------------------------------------
Attachment: HBASE_24541-1.4.0.patch
Status: Patch Available (was: Open)
> Add support to run LoadIncrementalHFiles in a distributed manner
> ----------------------------------------------------------------
>
> Key: HBASE-24541
> URL: https://issues.apache.org/jira/browse/HBASE-24541
> Project: HBase
> Issue Type: Improvement
> Components: mapreduce, Performance
> Affects Versions: 1.4.0
> Reporter: Constantin-Catalin Luca
> Priority: Minor
> Attachments: HBASE_24541-1.4.0.patch
>
>
> LoadIncrementalHFiles takes a very long time to complete when running HBase
> on top of S3 and attempting to bulkload 500K-700K files.
> The root cause of this is a combination of the higher latency of S3 (as
> compared to HDFS) as well as the calls made by LoadIncrementalHFiles to the
> underlying filesystem(each file is opened, seeked to the trailer offset at
> the end, and then the trailer is read).
> Increasing the parallelism does not yield any significant improvement. This
> seems to stem from the fact that once the trailer is read the stream is not
> consumed to the end. This causes the underlying HTTP connection to be aborted
> and it cannot be re-used.
>
> The proposed solution would be to also add support to run
> LoadIncrementalHFiles on multiple machines as a map reduce job.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)