Re: [R-pkg-devel] Fast Matrix Serialization in R?

Simon Urbanek Thu, 09 May 2024 20:13:02 -0700

> On 10/05/2024, at 12:31 PM, Henrik Bengtsson <henrik.bengts...@gmail.com> 
> wrote:
> 
> On Thu, May 9, 2024 at 3:46 PM Simon Urbanek
> <simon.urba...@r-project.org> wrote:
>> 
>> FWIW serialize() is binary so there is no conversion to text:
>> 
>>> serialize(1:10+0L, NULL)
>> [1] 58 0a 00 00 00 03 00 04 02 00 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 
>> 00
>> [26] 00 0d 00 00 00 0a 00 00 00 01 00 00 00 02 00 00 00 03 00 00 00 04 00 00 
>> 00
>> [51] 05 00 00 00 06 00 00 00 07 00 00 00 08 00 00 00 09 00 00 00 0a
>> 
>> It uses the native representation so it is actually not as bad as it sounds.
>> 
>> One aspect I forgot to mention in the earlier thread is that if you don't 
>> need to exchange the serialized objects between machines with different 
>> endianness then avoiding the swap makes it faster. E.g, on Intel (which is 
>> little-endian and thus needs swapping):
>> 
>>> a=1:1e8/2
>>> system.time(serialize(a, NULL))
>>   user  system elapsed
>>  2.123   0.468   2.661
>>> system.time(serialize(a, NULL, xdr=FALSE))
>>   user  system elapsed
>>  0.393   0.348   0.742
> 
> Would it be worth looking into making xdr=FALSE the default? From
> help("serialize"):
> 
> xdr: a logical: if a binary representation is used, should a
> big-endian one (XDR) be used?
> ...
> As almost all systems in current use are little-endian, xdr = FALSE
> can be used to avoid byte-shuffling at both ends when transferring
> data from one little-endian machine to another (or between processes
> on the same machine). Depending on the system, this can speed up
> serialization and unserialization by a factor of up to 3x.
> 
> This seems like a low-hanging fruit that could spare the world from
> wasting unnecessary CPU cycles.
> 


I thought about it before, but the main problem here is (as often) 
compatibility. The current default guarantees that the output can be safely 
read on any machine while xdr=FALSE only works if used on machines with the 
same endianness and will fail horribly otherwise. R cannot really know whether 
the user intends to transport the serialized data to another machine or not, so 
it cannot assume it is safe unless the user indicates so. Therefore all we can 
safely do is tell the users that they should use it where appropriate -- and 
the documentation explicitly says so:

     As almost all systems in current use are little-endian, ‘xdr =
     FALSE’ can be used to avoid byte-shuffling at both ends when
     transferring data from one little-endian machine to another (or
     between processes on the same machine).  Depending on the system,
     this can speed up serialization and unserialization by a factor of
     up to 3x.

Unfortunately, no one bothers to reads the documentation so it is not as 
effective as changing the default, but for reasons above it is just not as easy 
to change. I do acknowledge that the risk is relatively low since big-endian 
machines are becoming rare, but it's not zero.

That said, what worries me a bit more is that some derived functions such as 
saveRDS() don't expose the xdr option, so you actually have no way to use the 
native binary format. I understand the logic - see above, but as you said, that 
makes them unnecessarily slow. I wonder if it may be worth doing something a 
bit smarter and tag officially a "reverse XDR" format instead - that way it 
would be well-defined and could be made the default. Interestingly, the 
de-serialization part actually doesn't care, so you can use readRDS() on the 
binary serialization even in current R versions, so just adding the option 
would still be backwards-compatible. Definitely something to think about...

Cheers,
Simon


> 
> 
>> 
>> Cheers,
>> Simon
>> 
>> ______________________________________________
>> R-package-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-package-devel
> 

______________________________________________
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel
Re: [R-pkg-devel] Fast Matrix Serialization in R?

Reply via email to