Re: LZO vs GZIP vs NO COMPREESSION: why is GZIP the winner ???

Dan Washusen Sun, 28 Feb 2010 16:21:25 -0800

My (very rough) calculation of the data size came up with around 50MB.  That
was assuming 400 bytes * 100,000 for the values, 32 + 8 * 13 * 100,000 for
the keys and an extra meg or two for extra key stuff.  I didn't understand
how that resulted in the a region split, so I assume we are still missing
some information (or I made a mistake).  As you mention, that should mean
that everything is in the MemStore and compression has not come into play
yet.  Puzzling...


On PE; there isn't currently a way to specify compression options on the
testtable without extending PE and overriding
org.apache.hadoop.hbase.PerformanceEvaluation#getTableDescriptor method.
 Maybe it could be added as an option?

Cheers,
Dan

On 1 March 2010 10:56, Jean-Daniel Cryans <jdcry...@apache.org> wrote:

> As Dan said, your data is so small you don't really trigger many
> different behaviors in HBase, it could very well kept mostly in the
> memstores where compression has no impact at all.
>
> WRT a benchmark, there's the PerformanceEvaluation (we call it PE for
> short) which is well maintained and lets you set a compression level.
> This page has an outdated help but it shows you how to run it:
> http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation
>
> Another option is importing the wikipedia dump, which is highly
> compressible and not manufactured like the PE. Last summer I wrote a
> small MR job to do the import easily and although the code is based on
> a dev version 0.20.0, it should be fairly easy to make it work on
> 0.20.3 (probably just replacing the libs). See
> http://code.google.com/p/hbase-wikipedia-loader/
>
> See the last paragraph of the Getting Started in the Wiki, I show some
> import numbers:
>
> "For example, it took 29 min on a 6 nodes cluster (1 master and 5
> region servers) with the same hardware (AMD Phenom(tm) 9550 Quad, 8GB,
> 2x1TB disks), 2 map slot per task tracker (that's 10 parallel maps),
> and GZ compression. With LZO and a new table it took 23 min 20 ses.
> Compressed the table is 32 regions big, uncompressed it's 93 and took
> 30 min 10 sec to import."
>
> You can see that the import was a lot faster on LZO. I didn't do any
> reading test tho...
>
> Good luck!
>
> J-D
>
> On Sun, Feb 28, 2010 at 9:30 AM, Vincent Barat <vincent.ba...@ubikod.com>
> wrote:
> > The impact of my cluster architecture on the performances is obviously
> the
> > same in my 3 test cases. Providing that I only change the compression
> type
> > between tests, I don't understand why changing the number of regions or
> > whatever else would change the speed ratio between my tests, especially
> > between the GZIP & LZO tests.
> >
> > Is there some ready to use and easy to setup benchmarks I could use to
> try
> > to reproduce the issue in a well known environment ?
> >
> > Le 25/02/10 19:29, Jean-Daniel Cryans a écrit :
> >>
> >> If only 1 region, providing more than one nodes will probably just
> >> slow down the test since the load is handled by one machine which has
> >> to replicate blocks 2 times. I think your test would have much more
> >> value if you really grew at least to 10 regions. Also make sure to run
> >> the tests more than once on completely new hbase setups (drop table +
> >> restart should be enough).
> >>
> >> May I also recommend upgrading to hbase 0.20.3? It will provide a
> >> better experience in general.
> >>
> >> J-D
> >>
> >> On Thu, Feb 25, 2010 at 2:49 AM, Vincent Barat<vincent.ba...@ubikod.com
> >
> >>  wrote:
> >>>
> >>> Unfortunately I can post only some snapshots.
> >>>
> >>> I have no region split (I insert just 100000 rows so there is no split,
> >>> except when I don't use compression).
> >>>
> >>> I use HBase 0.20.2 and to insert I use the HTable.put(list<Put>);
> >>>
> >>> The only difference between my 3 tests is the way I create the test
> >>> table:
> >>>
> >>> HBaseAdmin admin = new HBaseAdmin(config);
> >>>
> >>> HTableDescriptor desc = new HTableDescriptor(name);
> >>>
> >>> HColumnDescriptor colDesc;
> >>>
> >>> colDesc = new HColumnDescriptor(Bytes.toBytes("meta:"));
> >>> colDesc.setMaxVersions(1);
> >>> colDesc.setCompressionType(Algorithm.GZ);<- LZO or NONE
> >>> desc.addFamily(colDesc);
> >>>
> >>> colDesc = new HColumnDescriptor(Bytes.toBytes("data:"));
> >>> colDesc.setMaxVersions(1);
> >>> colDesc.setCompressionType(Algorithm.GZ);<- LZO or NONE
> >>> desc.addFamily(colDesc);
> >>>
> >>> admin.createTable(desc);
> >>>
> >>> A typical row inserted is made of 13 columns with a short content, as
> >>> show
> >>> here:
> >>>
> >>> 1264761195240/6ffc3fe659023 column=data:accuracy,
> >>> timestamp=1267006115356,
> >>> value=1317
> >>>  a3c9cfed0a50a9f199ed42f2730
> >>>  1264761195240/6ffc3fe659023 column=data:alt, timestamp=1267006115356,
> >>> value=0
> >>>  a3c9cfed0a50a9f199ed42f2730
> >>>  1264761195240/6ffc3fe659023 column=data:country,
> >>> timestamp=1267006115356,
> >>> value=France
> >>>  a3c9cfed0a50a9f199ed42f2730
> >>>  1264761195240/6ffc3fe659023 column=data:countrycode,
> >>> timestamp=1267006115356, value=FR
> >>>  a3c9cfed0a50a9f199ed42f2730
> >>>  1264761195240/6ffc3fe659023 column=data:lat, timestamp=1267006115356,
> >>> value=48.65869706
> >>>  a3c9cfed0a50a9f199ed42f2730
> >>>  1264761195240/6ffc3fe659023 column=data:locality,
> >>> timestamp=1267006115356,
> >>> value=Morsang-sur-Orge
> >>>  a3c9cfed0a50a9f199ed42f2730
> >>>  1264761195240/6ffc3fe659023 column=data:lon, timestamp=1267006115356,
> >>> value=2.36138182
> >>>  a3c9cfed0a50a9f199ed42f2730
> >>>  1264761195240/6ffc3fe659023 column=data:postalcode,
> >>> timestamp=1267006115356, value=91390
> >>>  a3c9cfed0a50a9f199ed42f2730
> >>>  1264761195240/6ffc3fe659023 column=data:region,
> timestamp=1267006115356,
> >>> value=Ile-de-France
> >>>  a3c9cfed0a50a9f199ed42f2730
> >>>  1264761195240/6ffc3fe659023 column=meta:imei, timestamp=1267006115356,
> >>> value=6ffc3fe659023a3c9cfed0a50a9f199e
> >>>  a3c9cfed0a50a9f199ed42f2730 d42f2730
> >>>  1264761195240/6ffc3fe659023 column=meta:infoid,
> timestamp=1267006115356,
> >>> value=ca30781e0c375a1236afbf323cbfa4
> >>>  a3c9cfed0a50a9f199ed42f2730 0dc2c7c7af
> >>>  1264761195240/6ffc3fe659023 column=meta:locid,
> timestamp=1267006115356,
> >>> value=5e15a0281e83cfe55ec1c362f84a39f
> >>>  a3c9cfed0a50a9f199ed42f2730 006f18128
> >>>  1264761195240/6ffc3fe659023 column=meta:timestamp,
> >>> timestamp=1267006115356,
> >>> value=1264761195240
> >>>  a3c9cfed0a50a9f199ed42f2730
> >>>
> >>> Maybe LZO works much better with fewer rows with bigger content?
> >>>
> >>> Le 24/02/10 19:10, Jean-Daniel Cryans a écrit :
> >>>>
> >>>> Are you able to post the code used for the insertion? It could be
> >>>> something with your usage pattern or something wrong with the code
> >>>> itself.
> >>>>
> >>>> How many rows are you inserting? Do you even have some region splits?
> >>>>
> >>>> J-D
> >>>>
> >>>> On Wed, Feb 24, 2010 at 1:52 AM, Vincent Barat<
> vincent.ba...@ubikod.com>
> >>>>  wrote:
> >>>>>
> >>>>> Yes of course.
> >>>>>
> >>>>> We use a 4 machine cluster (4 large instances on AWS): 8 GB RAM each,
> >>>>> dual
> >>>>> core CPU. 1 is for the Hadoop and HBase namenode / masters, and 3 are
> >>>>> hosting the datanode / regionservers.
> >>>>>
> >>>>> The table used for testing is first created, then I insert
> sequentially
> >>>>> a
> >>>>> set of rows and count the nb of rows inserted by second.
> >>>>>
> >>>>> I insert rows by set of 1000 (using HTable.put(list<Put>);
> >>>>>
> >>>>> When reading, I read also sequentially by using a scanner (scanner
> >>>>> caching
> >>>>> is set to 1024 rows).
> >>>>>
> >>>>> Maybe our installation of LZO is not good ?
> >>>>>
> >>>>>
> >>>>> Le 23/02/10 22:15, Jean-Daniel Cryans a écrit :
> >>>>>>
> >>>>>> Vincent,
> >>>>>>
> >>>>>> I don't expect that either, can you give us more info about your
> test
> >>>>>> environment?
> >>>>>>
> >>>>>> Thx,
> >>>>>>
> >>>>>> J-D
> >>>>>>
> >>>>>> On Tue, Feb 23, 2010 at 10:39 AM, Vincent Barat
> >>>>>> <vincent.ba...@ubikod.com>      wrote:
> >>>>>>>
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>> I did some testing to figure out which compression algo I should
> use
> >>>>>>> for
> >>>>>>> my
> >>>>>>> HBase tables. I thought that LZO was the good candidate, but it
> >>>>>>> appears
> >>>>>>> that
> >>>>>>> it is the worst one.
> >>>>>>>
> >>>>>>> I uses one table with 2 families and 10 columns. Each row has a
> total
> >>>>>>> of
> >>>>>>> 200
> >>>>>>> to 400 bytes.
> >>>>>>>
> >>>>>>> Here is my results:
> >>>>>>>
> >>>>>>> GZIP:           2600 to 3200 inserts/s  12000 to 15000 reads/s
> >>>>>>> NO COMPRESSION: 2000 to 2600 inserts/s  4900 to 5020 reads/s
> >>>>>>> LZO             1600 to 2100 inserts/s  4020 to 4600 reads/s
> >>>>>>>
> >>>>>>> Do you have an explanation to this ? I though that the LZO
> >>>>>>> compression
> >>>>>>> was
> >>>>>>> always faster at compression and decompression than GZIP ?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>

Re: LZO vs GZIP vs NO COMPREESSION: why is GZIP the winner ???

Reply via email to