On Wed, 7 Mar 2007, Henrik Bengtsson wrote: > To follow up, I went ahead and generated "random" object to scan for a > common header for a given R version, and it seems to be that at most > the first 18 bytes are non-data specific, which could be the length of > the serialization header. > > Here is my code for this: > > scanSerialize <- function(object, hdr=NULL, ...) { > # Serialize object > raw <- serialize(object, connection=NULL, ascii=TRUE); > > # First run? > if (is.null(hdr)) > return(raw); > > # Find differences between current longest header and new raw vector > n <- length(hdr); > diffs <- (as.integer(hdr) != as.integer(raw[1:n])); > > # No differences? > if (!any(diffs)) > return(hdr); > > # Position of first difference > idx <- which(diffs)[1]; > > # Keep common header > hdr <- hdr[seq_len(idx-1)]; > > hdr; > }; > > # Serialize a first "random" object > hdr <- scanSerialize(NA); > for (kk in 1:100) > hdr <- scanSerialize(kk, hdr=hdr); > for (kk in 1:100) { > x <- sample(letters, size=sample(100), replace=TRUE); > hdr <- scanSerialize(x, hdr=hdr); > } > for (kk in 1:100) { > hdr <- scanSerialize(kk, hdr=hdr); > hdr <- scanSerialize(hdr, hdr=hdr); > } > > cat("Length:", length(hdr), "\n"); > print(hdr); > print(rawToChar(hdr)); > > On R v2.5.0 devel, this gives: > Length: 18 > [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a > [1] "A\n2\n132352\n131840\n" > > However, it would still be good to get an "official" statement from > one in the R-code team about the serialization header and where the > data section start. Again, I want to cut out as much as possible for > consistency between R version without loosing data dependent bytes.
An official, and definitive, statement from the _R-core_ team has been available to you all along at https://svn.r-project.org/R/trunk/src/main/serialize.c My unofficial and non-definitive interpretation of that statement is that there is a header of four items, A format code 'A' or 'X' ('B' also possible in older formats) version number of the format Packed integer containint the R version that did the serializing Packed integer containing the oldest R version that can read the format You can see this if you look at the ascii version as text: > serialize(1, stdout(), ascii=TRUE) A 2 132097 131840 14 1 1 NULL > serialize(as.integer(1), stdout(), ascii=TRUE) A 2 132097 131840 13 1 1 NULL In the non-ascii 'X' (as in xdr) format this will constitute 13 bytes. In ascii format I believe it is currently 18 bytes but this could change with the version number of R -- I'd have to read the official and definitive statement to see how the integer packing is done and work out whether that could change the number of bytes. The number of bytes would also change if we reached format version 10, but something about the format would also change of course. A safer way to look at the header in the ascii version is as the first four lines. Best, luke > > Thanks > > /Henrik > > On 3/7/07, Henrik Bengtsson <[EMAIL PROTECTED]> wrote: >> Hi, >> >> I noticed that serialize() gives different results depending on R >> version, which has implications to the digest() function in the digest >> package. Note, it does give the same output across platforms. I know >> that serialize() is under development, but is this expected, e.g. is >> there some kind of header in the result that specifies "who" generated >> the stream, and if so, exactly what bytes are they? >> >> SETUP: >> >> R versions: >> A) R v2.4.0 (2006-10-03) >> B) R v2.4.1pat (2007-01-13 r40470) >> C) R v2.5.0dev (2006-12-12 r40167) >> >> This is on WinXP and I start R with Rterm --vanilla. >> >> Example: Identical serialize() calls using the different R versions. >> >>> raw <- serialize(1, connection=NULL, ascii=TRUE) >>> print(raw) >> >> gives: >> >> (A): [1] 41 0a 32 0a 31 33 32 30 39 36 0a 31 33 31 38 34 30 0a 31 34 >> 0a 31 0a 31 0a >> (B): [1] 41 0a 32 0a 31 33 32 30 39 37 0a 31 33 31 38 34 30 0a 31 34 >> 0a 31 0a 31 0a >> (C): [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a 31 34 >> 0a 31 0a 31 0a >> >> Note the difference in raw bytes 8 to 10, i.e. >> >>> raw[7:11] >> (A): [1] 32 30 39 36 0a >> (B): [1] 32 30 39 37 0a >> (C): [1] 32 33 35 32 0a >> >> Does bytes 8, 9 and 10 in the raw vector somehow contain information >> about the R version or similar? The following poor mans test says >> that is the only difference: >> >> On all R versions, the following gives identical results: >> >>> raw <- serialize(1:1e4, connection=NULL, ascii=TRUE) >>> raw <- as.integer(raw[-c(8:10)]) >>> sum(raw) >> [1] 2147884 >>> sum(log(raw)) >> [1] 177201.2 >> >> If it is true that there is a R version specific header in serialized >> objects, then the digest() function should exclude such header in >> order to produce consistent results across R versions, because now >> digest(1) gives different results. >> >> Thank you >> >> Henrik >> > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > -- Luke Tierney Chair, Statistics and Actuarial Science Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics and Fax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: [EMAIL PROTECTED] Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel