[GitHub] [hbase] catalin-luca opened a new pull request #2002: HBASE-24541 Add support to run LoadIncrementalHFiles in a distributed manner

GitBox Tue, 30 Jun 2020 07:40:58 -0700


catalin-luca opened a new pull request #2002:
URL: https://github.com/apache/hbase/pull/2002



   LoadIncrementalHFiles takes a very long time to complete when running HBase 
on top of S3 and attempting to bulkload 500K-700K files.
   The root cause of this is a combination of the higher latency of S3 (as 
compared to HDFS) as well as the calls made by LoadIncrementalHFiles to the 
underlying filesystem(each file is opened, seeked to the trailer offset at the 
end, and then the trailer is read).
   Increasing the parallelism does not yield any significant improvement. This 
seems to stem from the fact that once the trailer is read the stream is not 
consumed to the end. This causes the underlying HTTP connection to be aborted 
and it cannot be re-used.
    
   The proposed solution would be to also add support to run 
LoadIncrementalHFiles on multiple machines as a map reduce job. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hbase] catalin-luca opened a new pull request #2002: HBASE-24541 Add support to run LoadIncrementalHFiles in a distributed manner

Reply via email to