Re: [DBpedia-developers] compression algorithm used for dump files

Jörn Hees Sat, 17 Sep 2016 09:32:54 -0700

Hi,

> On 16 Sep 2016, at 23:38, hellm...@informatik.uni-leipzig.de wrote:
>
> do you have a link where we can read up on it? Or an idea how we can test
> this quickly? Bz2 has the advantage of streamextracting with bzcat.

You'll find a lot of comparisons online, for example:
http://catchchallenger.first-world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO
(when comparing times, make sure to compare equal file sizes)

But maybe it's already enough to point out that the linux kernel is no longer
officially distributed as bz2 but xz (for > 2 years now):
https://www.kernel.org/happy-new-year-and-good-bye-bzip2.html

Wrt. streamextracting:
xz follows the "commandline interface" of gzip and bzip2, so there is an xzcat
if you have xz installed.

There also is a program called pxz which parallelizes the compression.

To get both just do `apt-get install xz-utils pxz` on debian systems.

> How exactly can working with gzip be faster than uncompressed files?

I agree that this is counter-intuitive, but it becomes clear if you think about
the typical hardware stack we work on nowadays:
With local HDDs, even SSDs the typical IO Read speed goes up to 300 MB/s, let's
be generous and say it's 1 GB/s.
Typical CPU to RAM speeds however are still an order of magnitude faster
(typically > 10 GB/s).
Additionally, in our computers we have several cores nowadays.

For raw IO files (uncompressed), this typically results in a 1 GB/s processing
speed cap, while typically at least one of your CPU cores is pretty bored.
So even if you optimized all your algorithms to be very very fast, your
hardware device read speed will make it impossible for you to get faster.

If you however work on gzipped files, with NT or TTL you most certainly have
compression levels of 10 % or more. This means that your 1 GB/s read speed from
super fast SSDs is upgraded to 10 GB/s by one CPU spending a bit of its time to
decompress the 1 GB/s stream into a 10 GB/s stream that is now in RAM. You just
reached a 10x speedup by working straight from compressed files.

Obviously the above is only really useful for IO bound processes / streaming.
I've found this very useful and efficient several times before, working with
grep, sort, awk or importing dump-files into some store (e.g., virtuoso). If
however your processing is CPU bound (for example has read speeds in the < 10
MB/s range), then compression obviously won't speed anything up.

One more thing:
When working with sort, i can also recommend other compression algorithms less
known than gzip, which put even more focus on low effort (de-)compression
(e.g., lzop or snappy AKA zippy). I use lzop with "sort --compress-program
lzop" for example here:
https://joernhees.de/blog/2015/01/28/dbpedia-2014-stats-top-subjects-predicates-and-objects/
(Despite the fact that i use SSD TMP storage this speeds the whole thing up.)

Best,
Jörn

------------------------------------------------------------------------------
_______________________________________________
DBpedia-developers mailing list
DBpedia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-developers

Re: [DBpedia-developers] compression algorithm used for dump files

Reply via email to