My (very rough) calculation of the data size came up with around 50MB. That was assuming 400 bytes * 100,000 for the values, 32 + 8 * 13 * 100,000 for the keys and an extra meg or two for extra key stuff. I didn't understand how that resulted in the a region split, so I assume we are still missing some information (or I made a mistake). As you mention, that should mean that everything is in the MemStore and compression has not come into play yet. Puzzling...
On PE; there isn't currently a way to specify compression options on the testtable without extending PE and overriding org.apache.hadoop.hbase.PerformanceEvaluation#getTableDescriptor method. Maybe it could be added as an option? Cheers, Dan On 1 March 2010 10:56, Jean-Daniel Cryans <jdcry...@apache.org> wrote: > As Dan said, your data is so small you don't really trigger many > different behaviors in HBase, it could very well kept mostly in the > memstores where compression has no impact at all. > > WRT a benchmark, there's the PerformanceEvaluation (we call it PE for > short) which is well maintained and lets you set a compression level. > This page has an outdated help but it shows you how to run it: > http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation > > Another option is importing the wikipedia dump, which is highly > compressible and not manufactured like the PE. Last summer I wrote a > small MR job to do the import easily and although the code is based on > a dev version 0.20.0, it should be fairly easy to make it work on > 0.20.3 (probably just replacing the libs). See > http://code.google.com/p/hbase-wikipedia-loader/ > > See the last paragraph of the Getting Started in the Wiki, I show some > import numbers: > > "For example, it took 29 min on a 6 nodes cluster (1 master and 5 > region servers) with the same hardware (AMD Phenom(tm) 9550 Quad, 8GB, > 2x1TB disks), 2 map slot per task tracker (that's 10 parallel maps), > and GZ compression. With LZO and a new table it took 23 min 20 ses. > Compressed the table is 32 regions big, uncompressed it's 93 and took > 30 min 10 sec to import." > > You can see that the import was a lot faster on LZO. I didn't do any > reading test tho... > > Good luck! > > J-D > > On Sun, Feb 28, 2010 at 9:30 AM, Vincent Barat <vincent.ba...@ubikod.com> > wrote: > > The impact of my cluster architecture on the performances is obviously > the > > same in my 3 test cases. Providing that I only change the compression > type > > between tests, I don't understand why changing the number of regions or > > whatever else would change the speed ratio between my tests, especially > > between the GZIP & LZO tests. > > > > Is there some ready to use and easy to setup benchmarks I could use to > try > > to reproduce the issue in a well known environment ? > > > > Le 25/02/10 19:29, Jean-Daniel Cryans a écrit : > >> > >> If only 1 region, providing more than one nodes will probably just > >> slow down the test since the load is handled by one machine which has > >> to replicate blocks 2 times. I think your test would have much more > >> value if you really grew at least to 10 regions. Also make sure to run > >> the tests more than once on completely new hbase setups (drop table + > >> restart should be enough). > >> > >> May I also recommend upgrading to hbase 0.20.3? It will provide a > >> better experience in general. > >> > >> J-D > >> > >> On Thu, Feb 25, 2010 at 2:49 AM, Vincent Barat<vincent.ba...@ubikod.com > > > >> wrote: > >>> > >>> Unfortunately I can post only some snapshots. > >>> > >>> I have no region split (I insert just 100000 rows so there is no split, > >>> except when I don't use compression). > >>> > >>> I use HBase 0.20.2 and to insert I use the HTable.put(list<Put>); > >>> > >>> The only difference between my 3 tests is the way I create the test > >>> table: > >>> > >>> HBaseAdmin admin = new HBaseAdmin(config); > >>> > >>> HTableDescriptor desc = new HTableDescriptor(name); > >>> > >>> HColumnDescriptor colDesc; > >>> > >>> colDesc = new HColumnDescriptor(Bytes.toBytes("meta:")); > >>> colDesc.setMaxVersions(1); > >>> colDesc.setCompressionType(Algorithm.GZ);<- LZO or NONE > >>> desc.addFamily(colDesc); > >>> > >>> colDesc = new HColumnDescriptor(Bytes.toBytes("data:")); > >>> colDesc.setMaxVersions(1); > >>> colDesc.setCompressionType(Algorithm.GZ);<- LZO or NONE > >>> desc.addFamily(colDesc); > >>> > >>> admin.createTable(desc); > >>> > >>> A typical row inserted is made of 13 columns with a short content, as > >>> show > >>> here: > >>> > >>> 1264761195240/6ffc3fe659023 column=data:accuracy, > >>> timestamp=1267006115356, > >>> value=1317 > >>> a3c9cfed0a50a9f199ed42f2730 > >>> 1264761195240/6ffc3fe659023 column=data:alt, timestamp=1267006115356, > >>> value=0 > >>> a3c9cfed0a50a9f199ed42f2730 > >>> 1264761195240/6ffc3fe659023 column=data:country, > >>> timestamp=1267006115356, > >>> value=France > >>> a3c9cfed0a50a9f199ed42f2730 > >>> 1264761195240/6ffc3fe659023 column=data:countrycode, > >>> timestamp=1267006115356, value=FR > >>> a3c9cfed0a50a9f199ed42f2730 > >>> 1264761195240/6ffc3fe659023 column=data:lat, timestamp=1267006115356, > >>> value=48.65869706 > >>> a3c9cfed0a50a9f199ed42f2730 > >>> 1264761195240/6ffc3fe659023 column=data:locality, > >>> timestamp=1267006115356, > >>> value=Morsang-sur-Orge > >>> a3c9cfed0a50a9f199ed42f2730 > >>> 1264761195240/6ffc3fe659023 column=data:lon, timestamp=1267006115356, > >>> value=2.36138182 > >>> a3c9cfed0a50a9f199ed42f2730 > >>> 1264761195240/6ffc3fe659023 column=data:postalcode, > >>> timestamp=1267006115356, value=91390 > >>> a3c9cfed0a50a9f199ed42f2730 > >>> 1264761195240/6ffc3fe659023 column=data:region, > timestamp=1267006115356, > >>> value=Ile-de-France > >>> a3c9cfed0a50a9f199ed42f2730 > >>> 1264761195240/6ffc3fe659023 column=meta:imei, timestamp=1267006115356, > >>> value=6ffc3fe659023a3c9cfed0a50a9f199e > >>> a3c9cfed0a50a9f199ed42f2730 d42f2730 > >>> 1264761195240/6ffc3fe659023 column=meta:infoid, > timestamp=1267006115356, > >>> value=ca30781e0c375a1236afbf323cbfa4 > >>> a3c9cfed0a50a9f199ed42f2730 0dc2c7c7af > >>> 1264761195240/6ffc3fe659023 column=meta:locid, > timestamp=1267006115356, > >>> value=5e15a0281e83cfe55ec1c362f84a39f > >>> a3c9cfed0a50a9f199ed42f2730 006f18128 > >>> 1264761195240/6ffc3fe659023 column=meta:timestamp, > >>> timestamp=1267006115356, > >>> value=1264761195240 > >>> a3c9cfed0a50a9f199ed42f2730 > >>> > >>> Maybe LZO works much better with fewer rows with bigger content? > >>> > >>> Le 24/02/10 19:10, Jean-Daniel Cryans a écrit : > >>>> > >>>> Are you able to post the code used for the insertion? It could be > >>>> something with your usage pattern or something wrong with the code > >>>> itself. > >>>> > >>>> How many rows are you inserting? Do you even have some region splits? > >>>> > >>>> J-D > >>>> > >>>> On Wed, Feb 24, 2010 at 1:52 AM, Vincent Barat< > vincent.ba...@ubikod.com> > >>>> wrote: > >>>>> > >>>>> Yes of course. > >>>>> > >>>>> We use a 4 machine cluster (4 large instances on AWS): 8 GB RAM each, > >>>>> dual > >>>>> core CPU. 1 is for the Hadoop and HBase namenode / masters, and 3 are > >>>>> hosting the datanode / regionservers. > >>>>> > >>>>> The table used for testing is first created, then I insert > sequentially > >>>>> a > >>>>> set of rows and count the nb of rows inserted by second. > >>>>> > >>>>> I insert rows by set of 1000 (using HTable.put(list<Put>); > >>>>> > >>>>> When reading, I read also sequentially by using a scanner (scanner > >>>>> caching > >>>>> is set to 1024 rows). > >>>>> > >>>>> Maybe our installation of LZO is not good ? > >>>>> > >>>>> > >>>>> Le 23/02/10 22:15, Jean-Daniel Cryans a écrit : > >>>>>> > >>>>>> Vincent, > >>>>>> > >>>>>> I don't expect that either, can you give us more info about your > test > >>>>>> environment? > >>>>>> > >>>>>> Thx, > >>>>>> > >>>>>> J-D > >>>>>> > >>>>>> On Tue, Feb 23, 2010 at 10:39 AM, Vincent Barat > >>>>>> <vincent.ba...@ubikod.com> wrote: > >>>>>>> > >>>>>>> Hello, > >>>>>>> > >>>>>>> I did some testing to figure out which compression algo I should > use > >>>>>>> for > >>>>>>> my > >>>>>>> HBase tables. I thought that LZO was the good candidate, but it > >>>>>>> appears > >>>>>>> that > >>>>>>> it is the worst one. > >>>>>>> > >>>>>>> I uses one table with 2 families and 10 columns. Each row has a > total > >>>>>>> of > >>>>>>> 200 > >>>>>>> to 400 bytes. > >>>>>>> > >>>>>>> Here is my results: > >>>>>>> > >>>>>>> GZIP: 2600 to 3200 inserts/s 12000 to 15000 reads/s > >>>>>>> NO COMPRESSION: 2000 to 2600 inserts/s 4900 to 5020 reads/s > >>>>>>> LZO 1600 to 2100 inserts/s 4020 to 4600 reads/s > >>>>>>> > >>>>>>> Do you have an explanation to this ? I though that the LZO > >>>>>>> compression > >>>>>>> was > >>>>>>> always faster at compression and decompression than GZIP ? > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > >