On 5/1/20 11:09 PM, John Mount wrote:
Perhaps use the digest package? Isn't "R the R packages?"

I think it is clear that I am aware of the existence of the digest package and also of other packages with similar functionality, e.g. the fastdigest package. (And I actually do use digest as I guess 99% percent of the R developers do at least as an indirect dependency.) The point is that a) digest is a wonderful and very stable package, but still, it is a user-contributed package, whereas b) 'tools' is a base package which is included by default in all R installations, and c) tools::md5sum already exists, with almost all building blocks to allow its extension to calculate MD5 hashes of R objects, and d) there is high demand in the R community for being able to calculate hashes.

So yes, if one wants to use all the utilities or the various algos that the digest package provides, one should install and load it. But if one can live with MD5 hashes, why not use the built-in R function? (Well, without serializing an object to a file, calling tools::md5sum, and then cleaning up the file.)


On May 1, 2020, at 2:00 PM, Dénes Tóth <toth.de...@kogentum.hu <mailto:toth.de...@kogentum.hu>> wrote:


AFAIK there is no hashing utility in base R which can create hash digests of arbitrary R objects. However, as also described by Henrik Bengtsson in [1], we have tools::md5sum() which calculates MD5 hashes of files. Calculating hashes of in-memory objects is a very common task in several areas, as demonstrated by the popularity of the 'digest' package (~850.000 downloads/month).

Upon the inspection of the relevant files in the R-source (e.g., [2] and [3]), it seems all building blocks have already been implemented so that hashing should not be restricted to files. I would like to ask:

1) Why is md5_buffer unused?:
In src/library/tools/src/md5.c [see 2], md5_buffer is implemented which seems to be the counterpart of md5_stream for non-file inputs:

---
#ifdef UNUSED
/* Compute MD5 message digest for LEN bytes beginning at BUFFER.  The
  result is always in little endian byte order, so that a byte-wise
  output yields to the wanted ASCII representation of the message
  digest.  */
static void *
md5_buffer (const char *buffer, size_t len, void *resblock)
{
 struct md5_ctx ctx;

 /* Initialize the computation context.  */
 md5_init_ctx (&ctx);

 /* Process whole buffer but last len % 64 bytes.  */
 md5_process_bytes (buffer, len, &ctx);

 /* Put result in desired memory area.  */
 return md5_finish_ctx (&ctx, resblock);
}
#endif
---

2) How can the R-community help so that this feature becomes available in package 'tools'?

Suggestions:
As a first step, it would be great if tools::md5sum would support connections (credit goes to Henrik for the idea). E.g., instead of the signature tools::md5sum(files), we could have tools::md5sum(files, conn = NULL), which would allow:

x <- runif(10)
tools::md5sum(conn = rawConnection(serialize(x, NULL)))

To avoid the inconsistency between 'files' (which computes the hash digests in a vectorized manner, that is, one for each file) and 'conn' (which expects a single connection), and to make it easier to extend the hashing for other algorithms without changing the main R interface, a more involved solution would be to introduce tools::hash and tools::hashes, in a similar vein to digest::digest and digest::getVDigest.

Regards,
Denes


[1]: https://github.com/HenrikBengtsson/Wishlist-for-R/issues/21
[2]: https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/md5.c#L172 [3]: https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/Rmd5.c#L27

______________________________________________
R-devel@r-project.org <mailto:R-devel@r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

---------------
John Mount
http://www.win-vector.com/
Our book: Practical Data Science with R
http://practicaldatascience.com






______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to