On Thu, Jan 15, 2015 at 11:08 AM, Simon Urbanek <simon.urba...@r-project.org> wrote: > In addition to the major points that others made: if you care about speed, > don't use compression. With today's fast disks it's an order of magnitude > slower to use compression: > >> d=lapply(1:10, function(x) as.integer(rnorm(1e7))) >> system.time(saveRDS(d, file="test.rds.gz")) > user system elapsed > 17.210 0.148 17.397 >> system.time(saveRDS(d, file="test.rds", compress=F)) > user system elapsed > 0.482 0.355 0.929 > > The above example is intentionally well compressible, in real life the > differences are actually even bigger. As people that deal with big data know > well, disks are no longer the bottleneck - it's the CPU now.
Respectfully, while your example would imply this, I don't think this is correct in the general case. Much faster compression schemes exist, and using these can improve disk I/O tremendously. Some schemes that are so fast that it's even faster to transfer compressed data from main RAM to CPU cache and then decompress to avoid being limited by RAM bandwidth: https://github.com/Blosc/c-blosc Repeating that for emphasis, compressing and uncompressing can be actually be faster than a straight memcpy()! Really, the issue is that 'gzip' and 'bzip2' are bottlenecks. As Steward suggests, this can be mitigated by throwing more cores at the problem. This isn't a bad solution, as there are often excess underutilized cores. But much better would be to choose a faster compression scheme first, and then parallelize that across cores if still necessary. Sometimes the tradeoff is between amount of compression and speed, and sometimes some algorithms are just faster than others. Here's some sample data for the test file that your example creates: > d=lapply(1:10, function(x) as.integer(rnorm(1e7))) > system.time(saveRDS(d, file="test.rds", compress=F)) user system elapsed 0.554 0.336 0.890 nate@ubuntu:~/R/rds$ ls -hs test.rds 382M test.rds nate@ubuntu:~/R/rds$ time gzip -c test.rds > test.rds.gz real: 16.207 sec nate@ubuntu:~/R/rds$ ls -hs test.rds.gz 35M test.rds.gz nate@ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard real: 2.330 sec nate@ubuntu:~/R/rds$ time gzip -c --fast test.rds > test.rds.gz real: 4.759 sec nate@ubuntu:~/R/rds$ ls -hs test.rds.gz 56M test.rds.gz nate@ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard real: 2.942 sec nate@ubuntu:~/R/rds$ time pigz -c test.rds > test.rds.gz real: 2.180 sec nate@ubuntu:~/R/rds$ ls -hs test.rds.gz 35M test.rds.gz nate@ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard real: 2.375 sec nate@ubuntu:~/R/rds$ time pigz -c --fast test.rds > test.rds.gz real: 0.739 sec nate@ubuntu:~/R/rds$ ls -hs test.rds.gz 57M test.rds.gz nate@ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard real: 2.851 sec nate@ubuntu:~/R/rds$ time lz4c test.rds > test.rds.lz4 Compressed 400000102 bytes into 125584749 bytes ==> 31.40% real: 1.024 sec nate@ubuntu:~/R/rds$ ls -hs test.rds.lz4 120M test.rds.lz4 nate@ubuntu:~/R/rds$ time lz4 test.rds.lz4 > discard Compressed 125584749 bytes into 95430573 bytes ==> 75.99% real: 0.775 sec Reading that last one more closely, with single threaded lz4 compression, we're getting 3x compression at about 400MB/s, and decompression at about 500MB/s. This is faster than almost any single disk will be. Multithreaded implementations will make even the fastest RAID be the bottleneck. It's probably worth noting that the speeds reported in your simple example for the uncompressed case are likely the speed of writing to memory, with the actual write to disk happening at some later time. Sustained throughput will likely be slower than your example would imply If saving data to disk is a bottleneck, I think Stewart is right that there is a lot of room for improvement. --nate ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel