[ 
https://issues.apache.org/jira/browse/HBASE-6358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13428337#comment-13428337
 ] 

Todd Lipcon commented on HBASE-6358:
------------------------------------

The problem of doing it automatically in LoadIncrementalHFiles (i.e the client) 
is that it is going to be very slow for any non-trivial amount of data, to 
funnel it through this single node.

Here's an alternate idea:
1. In this JIRA, change the RS side to fail if the filesystem doesn't match
2. Separately, add a new "DistributedLoadIncrementalHFiles" program which acts 
as a combination of distcp and LoadIncrementalHFiles. For each RS (or perhaps 
for each region), it would create one map task, with a locality hint to that 
server. Then the task would copy the relevant file (achieving a local replica) 
and make the necessary call to load the file.

Between step 1 and 2, users would have to use distcp and sacrifice locality. 
But, with the current scheme, they already don't get locality for the common 
case where the MR job runs on the same cluster as HBase.

Thoughts?
                
> Bulkloading from remote filesystem is problematic
> -------------------------------------------------
>
>                 Key: HBASE-6358
>                 URL: https://issues.apache.org/jira/browse/HBASE-6358
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.94.0
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: 6358-suggestion.txt, HBASE-6358-trunk-v1.diff, 
> HBASE-6358-trunk-v2.diff, HBASE-6358-trunk-v3.diff
>
>
> Bulk loading hfiles that don't live on the same filesystem as HBase can cause 
> problems for subtle reasons.
> In Store.bulkLoadHFile(), the regionserver will copy the source hfile to its 
> own filesystem if it's not already there. Since this can take a long time for 
> large hfiles, it's likely that the client will timeout and retry. When the 
> client retries repeatedly, there may be several bulkload operations in flight 
> for the same hfile, causing lots of unnecessary IO and tying up handler 
> threads. This can seriously impact performance. In my case, the cluster 
> became unusable and the regionservers had to be kill -9'ed.
> Possible solutions:
>  # Require that hfiles already be on the same filesystem as HBase in order 
> for bulkloading to succeed. The copy could be handled by 
> LoadIncrementalHFiles before the regionserver is called.
>  # Others? I'm not familiar with Hadoop IPC so there may be tricks to extend 
> the timeout or something else.
> I'm willing to write a patch but I'd appreciate recommendations on how to 
> proceed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to