Re: Fast importing into HBase (bypassing RegionServer)

Bradford Stephens Tue, 04 Aug 2009 19:58:40 -0700

Yeah, I considered working on this -- but we can import our entire
production DB in just a few hours, and then it's all incremental from
there. So bulk insert isn't a huge use case for us.


On Tue, Jul 28, 2009 at 9:19 AM, Jonathan Gray<[email protected]> wrote:
> Though HBase imports are fairly fast, they would probably be 5-10x faster
> with a straight-to-hfile import method.
>
> Once we get 0.20.0 shipped, we should have more time to spend on actually
> implementing this.  Though anyone is welcome to take a shot. Stack described
> it well.
>
> JG
>
> Ryan Rawson wrote:
>>
>> The last time I seriously looked at this, it was to answer serious
>> performance issues with HBase.  I eventually fixed said performance
>> issues, and thus went on to drop the idea overall.
>>
>> -ryan
>>
>> On Mon, Jul 27, 2009 at 1:52 PM, stack<[email protected]> wrote:
>>>
>>> Latest thinking is write a MR job that in the reducer writes hfiles that
>>> are
>>> just under a region size (<256M).  When reducer has reached about 240MB,
>>> it
>>> opens new file.  (May need to write custom ReduceRunner to keep account
>>> of
>>> whats been written and to rotate the file).
>>>
>>> After the MR has finished, a script would come along, move the hfiles
>>> into
>>> appropriate directory structure.  Each hfile would be the sole content of
>>> the region.  The script would read from each hfile's metadata its first
>>> and
>>> last keys and then using this metainfo along with a table format
>>> specified
>>> externally, insert an entry into .META. per region (See the scripts in
>>> bin
>>> -- copy and rename table -- for examples of how to manipulate .META.).
>>>
>>> Someone needs to just do it.  We've been talking about it for ever.
>>>
>>> St.Ack
>>> P.S. Here is older thinking on the topic
>>> https://issues.apache.org/jira/browse/HBASE-48
>>>
>>> On Mon, Jul 27, 2009 at 1:31 PM, tim robertson
>>> <[email protected]>wrote:
>>>
>>>> Hi all,
>>>>
>>>> Ryan wrote on a different thread:
>>>>
>>>> "It should be possible to randomly insert data from a pre-existing
>>>> data set.  There is some work to directly import straight into hfiles
>>>> and skipping the regionserver, but that would only really work on 1
>>>> time imports to new tables."
>>>>
>>>> Could someone please elaborate on this a little and outline the steps
>>>> needed?  Do you write an hfile in a custom mapreduce output format and
>>>> then somehow write the table metadata file afterwards?
>>>>
>>>> Cheers,
>>>>
>>>> Tim
>>>>
>>
>



-- 
http://www.roadtofailure.com -- The Fringes of Scalability, Social
Media, and Computer Science

Re: Fast importing into HBase (bypassing RegionServer)

Reply via email to