Re: optimising loading of tab file

tim robertson Wed, 22 Jul 2009 09:23:58 -0700

Strangely enough, it didn't help.  I suspect I am just overloading the
machines - they only have 4G ram.
When I use a separate machine and a single thread is pushing in 1000
inserts per second, but a MapReduce on the cluster is doing only 500
(8 map tasks running on 4 nodes).



Cheers,

Tim


On Wed, Jul 22, 2009 at 5:21 PM, tim robertson<[email protected]> wrote:
> Below is a sample row (\N are ignored in the Map) so I will try the
> default of 2meg which should buffer a bunch before flushing
>
> Thanks for your tips,
>
> Tim
>
> 199798861       293     8107    8436    MNHNL   Recorder database
>  LUXNATFUND404573t       Pilophorus cinnamopterus (KIRSCHBAUM,18
> 56)      \N      \N      \N      \N      \N      \N      \N      \N
>  \N      \N      49.61   6.13    \N      \N      \N      \N
>      \N      \N      \N      \N      \N      \N      \N      L.
> Reichling    Parc (Luxembourg)       1979    7       10      \N      \
> N      \N      \N      2009-02-20 04:19:51     2009-02-20 08:40:21
> \N      199798861       293     8107    29773   1519409 11922838
> 1       21560621        9917520 \N      \N      \N      \N      \N
>  \N      \N      \N      \N      49.61   6.13    50226   61
>      186     1979    7       1979-07-10      0       0       0
> 2       \N      \N      \N      \N
>
>
> On Wed, Jul 22, 2009 at 5:13 PM, Jean-Daniel Cryans<[email protected]> 
> wrote:
>> It really depends on the size of each Put. If 1 put = 1MB, then a 2MB
>> buffer (the default) won't be useful. A 1GB buffer (what you wrote)
>> will likely OOME your client and, if not, your region servers will in
>> no time.
>>
>> So try with the default and then if it goes well you can try setting
>> it higher. Do you know the size of each row?
>>
>> J-D
>>
>> On Wed, Jul 22, 2009 at 11:04 AM, tim
>> robertson<[email protected]> wrote:
>>> Could you suggest a sensible write buffer size please?
>>>
>>> 1024x1024x1024 bytes?
>>>
>>> Cheers
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jul 22, 2009 at 4:41 PM, tim robertson<[email protected]> 
>>> wrote:
>>>> Thanks J-D
>>>>
>>>> I will try this now.
>>>>
>>>> On Wed, Jul 22, 2009 at 3:44 PM, Jean-Daniel Cryans<[email protected]> 
>>>> wrote:
>>>>> Tim,
>>>>>
>>>>> Are you using the write buffer? See HTable.setAutoFlush and
>>>>> HTable.setWriteBufferSize if not. This will help a lot.
>>>>>
>>>>> Also since you have only 4 machines, try setting the HDFS replication
>>>>> factor lower than 3.
>>>>>
>>>>> J-D
>>>>>
>>>>> On Wed, Jul 22, 2009 at 8:26 AM, tim robertson<[email protected]> 
>>>>> wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> I have a 70G sparsely populated tab file (74 columns) to load into 2
>>>>>> column families in a single HBase table.
>>>>>>
>>>>>> I am running on my tiny dev cluster (4 mac minis, 4G ram, each running
>>>>>> all Hadoop demons and RegionServers) to just familiarise myself, while
>>>>>> the proper rack is being set up.
>>>>>>
>>>>>> I wrote a MapReduce job where I load into HBase during the Map:
>>>>>>  String rowID = UUID.randomUUID().toString();
>>>>>>  Put row = new Put(rowID.getBytes());
>>>>>>  int fields = reader.readAllInto(splits, row);  // uses a properties
>>>>>> file to map tab columns to column families
>>>>>>  context.setStatus("Map updating cell for row[" + rowID+ "] with " +
>>>>>> fields + " fields");
>>>>>>  table.put(row);
>>>>>>
>>>>>> Is this the preferred way to do this kind of loading or is a
>>>>>> TableOutputFormat likely to outperform the Map version?
>>>>>>
>>>>>> [Knowing performance estimates are pointless on this cluster - I see
>>>>>> 500 records per sec input, which is a bit disappointing.  I have
>>>>>> default Hadoop and HBase config and had to put a ZK quorum on each to
>>>>>> get HBase to start]
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Tim
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: optimising loading of tab file

Reply via email to