Re: optimising loading of tab file

Jean-Daniel Cryans Wed, 22 Jul 2009 09:28:46 -0700

afaik mac minis have just 2 cores right? So 2 map tasks per machine +
datanode + region server + ZK = 5 processes. From what I've seen the
region server will eat at least 1 CPU while under import so that does
not leave a lot of room for the rest. You could try with 1 map slot
per machine and give HBase a heap of 2GB.


J-D

On Wed, Jul 22, 2009 at 12:23 PM, tim
robertson<[email protected]> wrote:
> Strangely enough, it didn't help.  I suspect I am just overloading the
> machines - they only have 4G ram.
> When I use a separate machine and a single thread is pushing in 1000
> inserts per second, but a MapReduce on the cluster is doing only 500
> (8 map tasks running on 4 nodes).
>
>
> Cheers,
>
> Tim
>
>
> On Wed, Jul 22, 2009 at 5:21 PM, tim robertson<[email protected]> 
> wrote:
>> Below is a sample row (\N are ignored in the Map) so I will try the
>> default of 2meg which should buffer a bunch before flushing
>>
>> Thanks for your tips,
>>
>> Tim
>>
>> 199798861       293     8107    8436    MNHNL   Recorder database
>>  LUXNATFUND404573t       Pilophorus cinnamopterus (KIRSCHBAUM,18
>> 56)      \N      \N      \N      \N      \N      \N      \N      \N
>>  \N      \N      49.61   6.13    \N      \N      \N      \N
>>      \N      \N      \N      \N      \N      \N      \N      L.
>> Reichling    Parc (Luxembourg)       1979    7       10      \N      \
>> N      \N      \N      2009-02-20 04:19:51     2009-02-20 08:40:21
>> \N      199798861       293     8107    29773   1519409 11922838
>> 1       21560621        9917520 \N      \N      \N      \N      \N
>>  \N      \N      \N      \N      49.61   6.13    50226   61
>>      186     1979    7       1979-07-10      0       0       0
>> 2       \N      \N      \N      \N
>>
>>
>> On Wed, Jul 22, 2009 at 5:13 PM, Jean-Daniel Cryans<[email protected]> 
>> wrote:
>>> It really depends on the size of each Put. If 1 put = 1MB, then a 2MB
>>> buffer (the default) won't be useful. A 1GB buffer (what you wrote)
>>> will likely OOME your client and, if not, your region servers will in
>>> no time.
>>>
>>> So try with the default and then if it goes well you can try setting
>>> it higher. Do you know the size of each row?
>>>
>>> J-D
>>>
>>> On Wed, Jul 22, 2009 at 11:04 AM, tim
>>> robertson<[email protected]> wrote:
>>>> Could you suggest a sensible write buffer size please?
>>>>
>>>> 1024x1024x1024 bytes?
>>>>
>>>> Cheers
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Jul 22, 2009 at 4:41 PM, tim robertson<[email protected]> 
>>>> wrote:
>>>>> Thanks J-D
>>>>>
>>>>> I will try this now.
>>>>>
>>>>> On Wed, Jul 22, 2009 at 3:44 PM, Jean-Daniel Cryans<[email protected]> 
>>>>> wrote:
>>>>>> Tim,
>>>>>>
>>>>>> Are you using the write buffer? See HTable.setAutoFlush and
>>>>>> HTable.setWriteBufferSize if not. This will help a lot.
>>>>>>
>>>>>> Also since you have only 4 machines, try setting the HDFS replication
>>>>>> factor lower than 3.
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Wed, Jul 22, 2009 at 8:26 AM, tim 
>>>>>> robertson<[email protected]> wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I have a 70G sparsely populated tab file (74 columns) to load into 2
>>>>>>> column families in a single HBase table.
>>>>>>>
>>>>>>> I am running on my tiny dev cluster (4 mac minis, 4G ram, each running
>>>>>>> all Hadoop demons and RegionServers) to just familiarise myself, while
>>>>>>> the proper rack is being set up.
>>>>>>>
>>>>>>> I wrote a MapReduce job where I load into HBase during the Map:
>>>>>>>  String rowID = UUID.randomUUID().toString();
>>>>>>>  Put row = new Put(rowID.getBytes());
>>>>>>>  int fields = reader.readAllInto(splits, row);  // uses a properties
>>>>>>> file to map tab columns to column families
>>>>>>>  context.setStatus("Map updating cell for row[" + rowID+ "] with " +
>>>>>>> fields + " fields");
>>>>>>>  table.put(row);
>>>>>>>
>>>>>>> Is this the preferred way to do this kind of loading or is a
>>>>>>> TableOutputFormat likely to outperform the Map version?
>>>>>>>
>>>>>>> [Knowing performance estimates are pointless on this cluster - I see
>>>>>>> 500 records per sec input, which is a bit disappointing.  I have
>>>>>>> default Hadoop and HBase config and had to put a ZK quorum on each to
>>>>>>> get HBase to start]
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Tim
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: optimising loading of tab file

Reply via email to