Re: [Rd] Request to speed up save()

Nathan Kurz Thu, 15 Jan 2015 14:21:07 -0800

On Thu, Jan 15, 2015 at 11:08 AM, Simon Urbanek
<simon.urba...@r-project.org> wrote:
> In addition to the major points that others made: if you care about speed, 
> don't use compression. With today's fast disks it's an order of magnitude 
> slower to use compression:
>
>> d=lapply(1:10, function(x) as.integer(rnorm(1e7)))
>> system.time(saveRDS(d, file="test.rds.gz"))
>    user  system elapsed
>  17.210   0.148  17.397
>> system.time(saveRDS(d, file="test.rds", compress=F))
>    user  system elapsed
>   0.482   0.355   0.929
>
> The above example is intentionally well compressible, in real life the 
> differences are actually even bigger. As people that deal with big data know 
> well, disks are no longer the bottleneck - it's the CPU now.


Respectfully, while your example would imply this, I don't think this
is correct in the general case.    Much faster compression schemes
exist, and using these can improve disk I/O tremendously.  Some
schemes that are so fast that it's even faster to transfer compressed
data from main RAM to CPU cache and then decompress to avoid being
limited by RAM bandwidth: https://github.com/Blosc/c-blosc

Repeating that for emphasis, compressing and uncompressing can be
actually be faster than a straight memcpy()!

Really, the issue is that 'gzip' and 'bzip2' are bottlenecks.   As
Steward suggests, this can be mitigated by throwing more cores at the
problem.  This isn't a bad solution, as there are often excess
underutilized cores.  But much better would be to choose a faster
compression scheme first, and then parallelize that across cores if
still necessary.

Sometimes the tradeoff is between amount of compression and speed, and
sometimes some algorithms are just faster than others.   Here's some
sample data for the test file that your example creates:

> d=lapply(1:10, function(x) as.integer(rnorm(1e7)))
> system.time(saveRDS(d, file="test.rds", compress=F))
   user  system elapsed
  0.554   0.336   0.890

nate@ubuntu:~/R/rds$ ls -hs test.rds
382M test.rds
nate@ubuntu:~/R/rds$ time gzip -c test.rds > test.rds.gz
real: 16.207 sec
nate@ubuntu:~/R/rds$ ls -hs test.rds.gz
35M test.rds.gz
nate@ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard
real: 2.330 sec

nate@ubuntu:~/R/rds$ time gzip -c --fast test.rds > test.rds.gz
real: 4.759 sec
nate@ubuntu:~/R/rds$ ls -hs test.rds.gz
56M test.rds.gz
nate@ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard
real: 2.942 sec

nate@ubuntu:~/R/rds$ time pigz -c  test.rds > test.rds.gz
real: 2.180 sec
nate@ubuntu:~/R/rds$ ls -hs test.rds.gz
35M test.rds.gz
nate@ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard
real: 2.375 sec

nate@ubuntu:~/R/rds$ time pigz -c --fast test.rds > test.rds.gz
real: 0.739 sec
nate@ubuntu:~/R/rds$ ls -hs test.rds.gz
57M test.rds.gz
nate@ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard
real: 2.851 sec

nate@ubuntu:~/R/rds$ time lz4c test.rds > test.rds.lz4
Compressed 400000102 bytes into 125584749 bytes ==> 31.40%
real: 1.024 sec
nate@ubuntu:~/R/rds$ ls -hs test.rds.lz4
120M test.rds.lz4
nate@ubuntu:~/R/rds$ time lz4 test.rds.lz4 > discard
Compressed 125584749 bytes into 95430573 bytes ==> 75.99%
real: 0.775 sec

Reading that last one more closely, with single threaded lz4
compression, we're getting 3x compression at about 400MB/s, and
decompression at about 500MB/s.   This is faster than almost any
single disk will be.  Multithreaded implementations will make even the
fastest RAID be the bottleneck.

It's probably worth noting that the speeds reported in your simple
example for the uncompressed case are likely the speed of writing to
memory, with the actual write to disk happening at some later time.
Sustained throughput will likely be slower than your example would
imply

If saving data to disk is a bottleneck, I think Stewart is right that
there is a lot of room for improvement.

--nate

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Request to speed up save()

Reply via email to