Re: optimising loading of tab file

Jean-Daniel Cryans Wed, 22 Jul 2009 08:14:25 -0700

It really depends on the size of each Put. If 1 put = 1MB, then a 2MB
buffer (the default) won't be useful. A 1GB buffer (what you wrote)
will likely OOME your client and, if not, your region servers will in
no time.


So try with the default and then if it goes well you can try setting
it higher. Do you know the size of each row?

J-D

On Wed, Jul 22, 2009 at 11:04 AM, tim
robertson<[email protected]> wrote:
> Could you suggest a sensible write buffer size please?
>
> 1024x1024x1024 bytes?
>
> Cheers
>
>
>
>
>
> On Wed, Jul 22, 2009 at 4:41 PM, tim robertson<[email protected]> 
> wrote:
>> Thanks J-D
>>
>> I will try this now.
>>
>> On Wed, Jul 22, 2009 at 3:44 PM, Jean-Daniel Cryans<[email protected]> 
>> wrote:
>>> Tim,
>>>
>>> Are you using the write buffer? See HTable.setAutoFlush and
>>> HTable.setWriteBufferSize if not. This will help a lot.
>>>
>>> Also since you have only 4 machines, try setting the HDFS replication
>>> factor lower than 3.
>>>
>>> J-D
>>>
>>> On Wed, Jul 22, 2009 at 8:26 AM, tim robertson<[email protected]> 
>>> wrote:
>>>> Hi all,
>>>>
>>>> I have a 70G sparsely populated tab file (74 columns) to load into 2
>>>> column families in a single HBase table.
>>>>
>>>> I am running on my tiny dev cluster (4 mac minis, 4G ram, each running
>>>> all Hadoop demons and RegionServers) to just familiarise myself, while
>>>> the proper rack is being set up.
>>>>
>>>> I wrote a MapReduce job where I load into HBase during the Map:
>>>>  String rowID = UUID.randomUUID().toString();
>>>>  Put row = new Put(rowID.getBytes());
>>>>  int fields = reader.readAllInto(splits, row);  // uses a properties
>>>> file to map tab columns to column families
>>>>  context.setStatus("Map updating cell for row[" + rowID+ "] with " +
>>>> fields + " fields");
>>>>  table.put(row);
>>>>
>>>> Is this the preferred way to do this kind of loading or is a
>>>> TableOutputFormat likely to outperform the Map version?
>>>>
>>>> [Knowing performance estimates are pointless on this cluster - I see
>>>> 500 records per sec input, which is a bit disappointing.  I have
>>>> default Hadoop and HBase config and had to put a ZK quorum on each to
>>>> get HBase to start]
>>>>
>>>> Cheers,
>>>>
>>>> Tim
>>>>
>>>
>>
>

Re: optimising loading of tab file

Reply via email to