Re: optimising loading of tab file

tim robertson Thu, 23 Jul 2009 02:38:47 -0700

Thanks J-D,

2 cores - correct.  While this is kinda futile due to the hardware I
am running on, I am learning a fair amount which should translate to
tuning the real cluster.


I have moved the ZK to a single demon on my master, reduced the Maps
to 1 per node, dropped HDFS replication to 2, so I now have on each
slave:
- 1 Map, datanode, region server (with 2000M heap)
(and a separate master running NameNode, ZK etc)

What I found with this configuration was running from a standalone
client (on the master in eclipse) iterating the tab file gave 1000
inserts per second (2x improvement over previous config of ZK on each
machine) and the MapReduce load increased from 500/sec to 700/sec.
I'm surprised to see the Maps take a long long time to finish - I
wonder if there is some blocking going on or something like this...

Below are the Map functions - I could be doing something stupid of course ;o)

Thanks for any insights/ideas anyone can offer,

Tim


                @Override
                protected void setup(Context context) throws IOException,
                                InterruptedException {
                        super.setup(context);
                        hbConf = new HBaseConfiguration();
                table = new HTable(hbConf,
context.getConfiguration().get("table.name"));
                table.setAutoFlush(false);
                table.setWriteBufferSize(1024*1024*2);
                // this is a utility that uses a properties file to map
columns in fielded text
                // ignore \N and use tab file format
                reader = new
ConfigurableRecordReader(context.getConfiguration().get("input.mapping"),
true, "\t");
                }
                
                @Override
                protected void map(LongWritable key, Text value, Context 
context)
                                throws IOException, InterruptedException {
                        if ( table == null ) {
                        throw new IOException("Table cannot be null.  This 
Mapper is
not configured correctly.");
                        }
                        
                        String[] splits = reader.split(value.toString());
                        
                        // consider a business unique, or UUID generated from a 
business unique?
                        String rowID = UUID.randomUUID().toString();
                        
                        Put row = new Put(rowID.getBytes());
                        int fields = reader.readAllInto(splits, row);
                        context.setStatus("Map updating cell for row[" + rowID+ 
"] with " +
fields + " fields");
                        table.put(row);                 
                }








On Wed, Jul 22, 2009 at 6:28 PM, Jean-Daniel Cryans<[email protected]> wrote:
> afaik mac minis have just 2 cores right? So 2 map tasks per machine +
> datanode + region server + ZK = 5 processes. From what I've seen the
> region server will eat at least 1 CPU while under import so that does
> not leave a lot of room for the rest. You could try with 1 map slot
> per machine and give HBase a heap of 2GB.
>
> J-D
>
> On Wed, Jul 22, 2009 at 12:23 PM, tim
> robertson<[email protected]> wrote:
>> Strangely enough, it didn't help.  I suspect I am just overloading the
>> machines - they only have 4G ram.
>> When I use a separate machine and a single thread is pushing in 1000
>> inserts per second, but a MapReduce on the cluster is doing only 500
>> (8 map tasks running on 4 nodes).
>>
>>
>> Cheers,
>>
>> Tim
>>
>>
>> On Wed, Jul 22, 2009 at 5:21 PM, tim robertson<[email protected]> 
>> wrote:
>>> Below is a sample row (\N are ignored in the Map) so I will try the
>>> default of 2meg which should buffer a bunch before flushing
>>>
>>> Thanks for your tips,
>>>
>>> Tim
>>>
>>> 199798861       293     8107    8436    MNHNL   Recorder database
>>>  LUXNATFUND404573t       Pilophorus cinnamopterus (KIRSCHBAUM,18
>>> 56)      \N      \N      \N      \N      \N      \N      \N      \N
>>>  \N      \N      49.61   6.13    \N      \N      \N      \N
>>>      \N      \N      \N      \N      \N      \N      \N      L.
>>> Reichling    Parc (Luxembourg)       1979    7       10      \N      \
>>> N      \N      \N      2009-02-20 04:19:51     2009-02-20 08:40:21
>>> \N      199798861       293     8107    29773   1519409 11922838
>>> 1       21560621        9917520 \N      \N      \N      \N      \N
>>>  \N      \N      \N      \N      49.61   6.13    50226   61
>>>      186     1979    7       1979-07-10      0       0       0
>>> 2       \N      \N      \N      \N
>>>
>>>
>>> On Wed, Jul 22, 2009 at 5:13 PM, Jean-Daniel Cryans<[email protected]> 
>>> wrote:
>>>> It really depends on the size of each Put. If 1 put = 1MB, then a 2MB
>>>> buffer (the default) won't be useful. A 1GB buffer (what you wrote)
>>>> will likely OOME your client and, if not, your region servers will in
>>>> no time.
>>>>
>>>> So try with the default and then if it goes well you can try setting
>>>> it higher. Do you know the size of each row?
>>>>
>>>> J-D
>>>>
>>>> On Wed, Jul 22, 2009 at 11:04 AM, tim
>>>> robertson<[email protected]> wrote:
>>>>> Could you suggest a sensible write buffer size please?
>>>>>
>>>>> 1024x1024x1024 bytes?
>>>>>
>>>>> Cheers
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jul 22, 2009 at 4:41 PM, tim robertson<[email protected]> 
>>>>> wrote:
>>>>>> Thanks J-D
>>>>>>
>>>>>> I will try this now.
>>>>>>
>>>>>> On Wed, Jul 22, 2009 at 3:44 PM, Jean-Daniel Cryans<[email protected]> 
>>>>>> wrote:
>>>>>>> Tim,
>>>>>>>
>>>>>>> Are you using the write buffer? See HTable.setAutoFlush and
>>>>>>> HTable.setWriteBufferSize if not. This will help a lot.
>>>>>>>
>>>>>>> Also since you have only 4 machines, try setting the HDFS replication
>>>>>>> factor lower than 3.
>>>>>>>
>>>>>>> J-D
>>>>>>>
>>>>>>> On Wed, Jul 22, 2009 at 8:26 AM, tim 
>>>>>>> robertson<[email protected]> wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I have a 70G sparsely populated tab file (74 columns) to load into 2
>>>>>>>> column families in a single HBase table.
>>>>>>>>
>>>>>>>> I am running on my tiny dev cluster (4 mac minis, 4G ram, each running
>>>>>>>> all Hadoop demons and RegionServers) to just familiarise myself, while
>>>>>>>> the proper rack is being set up.
>>>>>>>>
>>>>>>>> I wrote a MapReduce job where I load into HBase during the Map:
>>>>>>>>  String rowID = UUID.randomUUID().toString();
>>>>>>>>  Put row = new Put(rowID.getBytes());
>>>>>>>>  int fields = reader.readAllInto(splits, row);  // uses a properties
>>>>>>>> file to map tab columns to column families
>>>>>>>>  context.setStatus("Map updating cell for row[" + rowID+ "] with " +
>>>>>>>> fields + " fields");
>>>>>>>>  table.put(row);
>>>>>>>>
>>>>>>>> Is this the preferred way to do this kind of loading or is a
>>>>>>>> TableOutputFormat likely to outperform the Map version?
>>>>>>>>
>>>>>>>> [Knowing performance estimates are pointless on this cluster - I see
>>>>>>>> 500 records per sec input, which is a bit disappointing.  I have
>>>>>>>> default Hadoop and HBase config and had to put a ZK quorum on each to
>>>>>>>> get HBase to start]
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Tim
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: optimising loading of tab file

Reply via email to