[ 
https://issues.apache.org/jira/browse/HBASE-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566397#action_12566397
 ] 

Bryan Duxbury commented on HBASE-48:
------------------------------------

In theory, writing directly to HDFS would be the fastest way to import data. 
However, the tricky part in my mind is that you need all the partitions not 
just to be sorted internally but sorted amongst each other. This means that the 
partitioning function you use has to be able to sort lexically as well. Without 
knowing what the data looks like ahead of time, how can you know how to 
efficiently partition the data into regions?

This doesn't account for trying to import a lot of data into a new table. In 
that case, it'd be quite futile to write tons of data into the existing regions 
range, because that would just cause the existing regions would just become 
enormous, and then all you're really doing is putting off the speed hit until 
the split/compact stage.

What is it that actually holds back the speed of imports? The API mechanics and 
nothing else? The number of region servers participating in the import? The 
speed of the underlying disk? Do we even have a sense of what would be a good 
speed for bulk imports in the first place? I think this issue needs better 
definition before we can say what we should do.

> [hbase] Bulk load and dump tools
> --------------------------------
>
>                 Key: HBASE-48
>                 URL: https://issues.apache.org/jira/browse/HBASE-48
>             Project: Hadoop HBase
>          Issue Type: New Feature
>            Reporter: stack
>            Priority: Minor
>
> Hbase needs tools to facilitate bulk upload and possibly dumping.  Going via 
> the current APIs, particularly if the dataset is large and cell content is 
> small, uploads can take a long time even when using many concurrent clients.
> PNUTS folks talked of need for a different API to manage bulk upload/dump.
> Another notion would be to somehow have the bulk loader tools somehow write 
> regions directly in hdfs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to