To follow up, I went ahead and generated "random" object to scan for a common header for a given R version, and it seems to be that at most the first 18 bytes are non-data specific, which could be the length of the serialization header.
Here is my code for this: scanSerialize <- function(object, hdr=NULL, ...) { # Serialize object raw <- serialize(object, connection=NULL, ascii=TRUE); # First run? if (is.null(hdr)) return(raw); # Find differences between current longest header and new raw vector n <- length(hdr); diffs <- (as.integer(hdr) != as.integer(raw[1:n])); # No differences? if (!any(diffs)) return(hdr); # Position of first difference idx <- which(diffs)[1]; # Keep common header hdr <- hdr[seq_len(idx-1)]; hdr; }; # Serialize a first "random" object hdr <- scanSerialize(NA); for (kk in 1:100) hdr <- scanSerialize(kk, hdr=hdr); for (kk in 1:100) { x <- sample(letters, size=sample(100), replace=TRUE); hdr <- scanSerialize(x, hdr=hdr); } for (kk in 1:100) { hdr <- scanSerialize(kk, hdr=hdr); hdr <- scanSerialize(hdr, hdr=hdr); } cat("Length:", length(hdr), "\n"); print(hdr); print(rawToChar(hdr)); On R v2.5.0 devel, this gives: Length: 18 [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a [1] "A\n2\n132352\n131840\n" However, it would still be good to get an "official" statement from one in the R-code team about the serialization header and where the data section start. Again, I want to cut out as much as possible for consistency between R version without loosing data dependent bytes. Thanks /Henrik On 3/7/07, Henrik Bengtsson <[EMAIL PROTECTED]> wrote: > Hi, > > I noticed that serialize() gives different results depending on R > version, which has implications to the digest() function in the digest > package. Note, it does give the same output across platforms. I know > that serialize() is under development, but is this expected, e.g. is > there some kind of header in the result that specifies "who" generated > the stream, and if so, exactly what bytes are they? > > SETUP: > > R versions: > A) R v2.4.0 (2006-10-03) > B) R v2.4.1pat (2007-01-13 r40470) > C) R v2.5.0dev (2006-12-12 r40167) > > This is on WinXP and I start R with Rterm --vanilla. > > Example: Identical serialize() calls using the different R versions. > > > raw <- serialize(1, connection=NULL, ascii=TRUE) > > print(raw) > > gives: > > (A): [1] 41 0a 32 0a 31 33 32 30 39 36 0a 31 33 31 38 34 30 0a 31 34 > 0a 31 0a 31 0a > (B): [1] 41 0a 32 0a 31 33 32 30 39 37 0a 31 33 31 38 34 30 0a 31 34 > 0a 31 0a 31 0a > (C): [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a 31 34 > 0a 31 0a 31 0a > > Note the difference in raw bytes 8 to 10, i.e. > > > raw[7:11] > (A): [1] 32 30 39 36 0a > (B): [1] 32 30 39 37 0a > (C): [1] 32 33 35 32 0a > > Does bytes 8, 9 and 10 in the raw vector somehow contain information > about the R version or similar? The following poor mans test says > that is the only difference: > > On all R versions, the following gives identical results: > > > raw <- serialize(1:1e4, connection=NULL, ascii=TRUE) > > raw <- as.integer(raw[-c(8:10)]) > > sum(raw) > [1] 2147884 > > sum(log(raw)) > [1] 177201.2 > > If it is true that there is a R version specific header in serialized > objects, then the digest() function should exclude such header in > order to produce consistent results across R versions, because now > digest(1) gives different results. > > Thank you > > Henrik > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel