[jira] [Commented] (HBASE-6358) Bulkloading from remote filesystem is problematic

Dave Revell (JIRA) Mon, 06 Aug 2012 09:59:04 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-6358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429262#comment-13429262
 ]


Dave Revell commented on HBASE-6358:
------------------------------------

{quote}If the size and speed don't matter, then wouldn't you have just used a 
normal (non-bulk-load) MR job to load the data?{quote}

There are other reasons to atomically load hfiles even for non-huge datasets, 
such as ETL and restoring backups. And atomicity could have some benefits for 
certain use cases. But it's probably not asking too much for people with these 
use cases to use a distributed hfile loader that depends on mapreduce, so I'm 
willing to concede the point.

@Todd, would you be in favor of adding another JIRA ticket for a distributed 
bulk loader, and having this ticket be blocked until it's done? I think it 
should be blocked so we don't remove the current "bulkload from remote fs" 
capability without offering an alternative, though the user does have the 
option of running distcp themselves.
                
> Bulkloading from remote filesystem is problematic
> -------------------------------------------------
>
>                 Key: HBASE-6358
>                 URL: https://issues.apache.org/jira/browse/HBASE-6358
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.94.0
>            Reporter: Dave Revell
>            Assignee: Dave Revell
>         Attachments: 6358-suggestion.txt, HBASE-6358-trunk-v1.diff, 
> HBASE-6358-trunk-v2.diff, HBASE-6358-trunk-v3.diff
>
>
> Bulk loading hfiles that don't live on the same filesystem as HBase can cause 
> problems for subtle reasons.
> In Store.bulkLoadHFile(), the regionserver will copy the source hfile to its 
> own filesystem if it's not already there. Since this can take a long time for 
> large hfiles, it's likely that the client will timeout and retry. When the 
> client retries repeatedly, there may be several bulkload operations in flight 
> for the same hfile, causing lots of unnecessary IO and tying up handler 
> threads. This can seriously impact performance. In my case, the cluster 
> became unusable and the regionservers had to be kill -9'ed.
> Possible solutions:
>  # Require that hfiles already be on the same filesystem as HBase in order 
> for bulkloading to succeed. The copy could be handled by 
> LoadIncrementalHFiles before the regionserver is called.
>  # Others? I'm not familiar with Hadoop IPC so there may be tricks to extend 
> the timeout or something else.
> I'm willing to write a patch but I'd appreciate recommendations on how to 
> proceed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6358) Bulkloading from remote filesystem is problematic

Reply via email to