Strangely enough, it didn't help. I suspect I am just overloading the machines - they only have 4G ram. When I use a separate machine and a single thread is pushing in 1000 inserts per second, but a MapReduce on the cluster is doing only 500 (8 map tasks running on 4 nodes).
Cheers, Tim On Wed, Jul 22, 2009 at 5:21 PM, tim robertson<[email protected]> wrote: > Below is a sample row (\N are ignored in the Map) so I will try the > default of 2meg which should buffer a bunch before flushing > > Thanks for your tips, > > Tim > > 199798861 293 8107 8436 MNHNL Recorder database > LUXNATFUND404573t Pilophorus cinnamopterus (KIRSCHBAUM,18 > 56) \N \N \N \N \N \N \N \N > \N \N 49.61 6.13 \N \N \N \N > \N \N \N \N \N \N \N L. > Reichling Parc (Luxembourg) 1979 7 10 \N \ > N \N \N 2009-02-20 04:19:51 2009-02-20 08:40:21 > \N 199798861 293 8107 29773 1519409 11922838 > 1 21560621 9917520 \N \N \N \N \N > \N \N \N \N 49.61 6.13 50226 61 > 186 1979 7 1979-07-10 0 0 0 > 2 \N \N \N \N > > > On Wed, Jul 22, 2009 at 5:13 PM, Jean-Daniel Cryans<[email protected]> > wrote: >> It really depends on the size of each Put. If 1 put = 1MB, then a 2MB >> buffer (the default) won't be useful. A 1GB buffer (what you wrote) >> will likely OOME your client and, if not, your region servers will in >> no time. >> >> So try with the default and then if it goes well you can try setting >> it higher. Do you know the size of each row? >> >> J-D >> >> On Wed, Jul 22, 2009 at 11:04 AM, tim >> robertson<[email protected]> wrote: >>> Could you suggest a sensible write buffer size please? >>> >>> 1024x1024x1024 bytes? >>> >>> Cheers >>> >>> >>> >>> >>> >>> On Wed, Jul 22, 2009 at 4:41 PM, tim robertson<[email protected]> >>> wrote: >>>> Thanks J-D >>>> >>>> I will try this now. >>>> >>>> On Wed, Jul 22, 2009 at 3:44 PM, Jean-Daniel Cryans<[email protected]> >>>> wrote: >>>>> Tim, >>>>> >>>>> Are you using the write buffer? See HTable.setAutoFlush and >>>>> HTable.setWriteBufferSize if not. This will help a lot. >>>>> >>>>> Also since you have only 4 machines, try setting the HDFS replication >>>>> factor lower than 3. >>>>> >>>>> J-D >>>>> >>>>> On Wed, Jul 22, 2009 at 8:26 AM, tim robertson<[email protected]> >>>>> wrote: >>>>>> Hi all, >>>>>> >>>>>> I have a 70G sparsely populated tab file (74 columns) to load into 2 >>>>>> column families in a single HBase table. >>>>>> >>>>>> I am running on my tiny dev cluster (4 mac minis, 4G ram, each running >>>>>> all Hadoop demons and RegionServers) to just familiarise myself, while >>>>>> the proper rack is being set up. >>>>>> >>>>>> I wrote a MapReduce job where I load into HBase during the Map: >>>>>> String rowID = UUID.randomUUID().toString(); >>>>>> Put row = new Put(rowID.getBytes()); >>>>>> int fields = reader.readAllInto(splits, row); // uses a properties >>>>>> file to map tab columns to column families >>>>>> context.setStatus("Map updating cell for row[" + rowID+ "] with " + >>>>>> fields + " fields"); >>>>>> table.put(row); >>>>>> >>>>>> Is this the preferred way to do this kind of loading or is a >>>>>> TableOutputFormat likely to outperform the Map version? >>>>>> >>>>>> [Knowing performance estimates are pointless on this cluster - I see >>>>>> 500 records per sec input, which is a bit disappointing. I have >>>>>> default Hadoop and HBase config and had to put a ZK quorum on each to >>>>>> get HBase to start] >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Tim >>>>>> >>>>> >>>> >>> >> >
