Hi
Luke Tierney wrote: > On Wed, 7 Mar 2007, Henrik Bengtsson wrote: > >> To follow up, I went ahead and generated "random" object to scan for a >> common header for a given R version, and it seems to be that at most >> the first 18 bytes are non-data specific, which could be the length of >> the serialization header. >> >> Here is my code for this: >> >> scanSerialize <- function(object, hdr=NULL, ...) { >> # Serialize object >> raw <- serialize(object, connection=NULL, ascii=TRUE); >> >> # First run? >> if (is.null(hdr)) >> return(raw); >> >> # Find differences between current longest header and new raw vector >> n <- length(hdr); >> diffs <- (as.integer(hdr) != as.integer(raw[1:n])); >> >> # No differences? >> if (!any(diffs)) >> return(hdr); >> >> # Position of first difference >> idx <- which(diffs)[1]; >> >> # Keep common header >> hdr <- hdr[seq_len(idx-1)]; >> >> hdr; >> }; >> >> # Serialize a first "random" object >> hdr <- scanSerialize(NA); >> for (kk in 1:100) >> hdr <- scanSerialize(kk, hdr=hdr); >> for (kk in 1:100) { >> x <- sample(letters, size=sample(100), replace=TRUE); >> hdr <- scanSerialize(x, hdr=hdr); >> } >> for (kk in 1:100) { >> hdr <- scanSerialize(kk, hdr=hdr); >> hdr <- scanSerialize(hdr, hdr=hdr); >> } >> >> cat("Length:", length(hdr), "\n"); >> print(hdr); >> print(rawToChar(hdr)); >> >> On R v2.5.0 devel, this gives: >> Length: 18 >> [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a >> [1] "A\n2\n132352\n131840\n" >> >> However, it would still be good to get an "official" statement from >> one in the R-code team about the serialization header and where the >> data section start. Again, I want to cut out as much as possible for >> consistency between R version without loosing data dependent bytes. > > An official, and definitive, statement from the _R-core_ team has been > available to you all along at > > https://svn.r-project.org/R/trunk/src/main/serialize.c There's also a bit of info on this in Section 1.7 of the "R Internals" Manual. Paul > My unofficial and non-definitive interpretation of that statement is > that there is a header of four items, > > A format code 'A' or 'X' ('B' also possible in older formats) > version number of the format > Packed integer containint the R version that did the serializing > Packed integer containing the oldest R version that can read the format > > You can see this if you look at the ascii version as text: > > > serialize(1, stdout(), ascii=TRUE) > A > 2 > 132097 > 131840 > 14 > 1 > 1 > NULL > > serialize(as.integer(1), stdout(), ascii=TRUE) > A > 2 > 132097 > 131840 > 13 > 1 > 1 > NULL > > In the non-ascii 'X' (as in xdr) format this will constitute 13 bytes. > In ascii format I believe it is currently 18 bytes but this could > change with the version number of R -- I'd have to read the official > and definitive statement to see how the integer packing is done and > work out whether that could change the number of bytes. The number of > bytes would also change if we reached format version 10, but something > about the format would also change of course. A safer way to look at > the header in the ascii version is as the first four lines. > > Best, > > luke > >> Thanks >> >> /Henrik >> >> On 3/7/07, Henrik Bengtsson <[EMAIL PROTECTED]> wrote: >>> Hi, >>> >>> I noticed that serialize() gives different results depending on R >>> version, which has implications to the digest() function in the digest >>> package. Note, it does give the same output across platforms. I know >>> that serialize() is under development, but is this expected, e.g. is >>> there some kind of header in the result that specifies "who" generated >>> the stream, and if so, exactly what bytes are they? >>> >>> SETUP: >>> >>> R versions: >>> A) R v2.4.0 (2006-10-03) >>> B) R v2.4.1pat (2007-01-13 r40470) >>> C) R v2.5.0dev (2006-12-12 r40167) >>> >>> This is on WinXP and I start R with Rterm --vanilla. >>> >>> Example: Identical serialize() calls using the different R versions. >>> >>>> raw <- serialize(1, connection=NULL, ascii=TRUE) >>>> print(raw) >>> gives: >>> >>> (A): [1] 41 0a 32 0a 31 33 32 30 39 36 0a 31 33 31 38 34 30 0a 31 34 >>> 0a 31 0a 31 0a >>> (B): [1] 41 0a 32 0a 31 33 32 30 39 37 0a 31 33 31 38 34 30 0a 31 34 >>> 0a 31 0a 31 0a >>> (C): [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a 31 34 >>> 0a 31 0a 31 0a >>> >>> Note the difference in raw bytes 8 to 10, i.e. >>> >>>> raw[7:11] >>> (A): [1] 32 30 39 36 0a >>> (B): [1] 32 30 39 37 0a >>> (C): [1] 32 33 35 32 0a >>> >>> Does bytes 8, 9 and 10 in the raw vector somehow contain information >>> about the R version or similar? The following poor mans test says >>> that is the only difference: >>> >>> On all R versions, the following gives identical results: >>> >>>> raw <- serialize(1:1e4, connection=NULL, ascii=TRUE) >>>> raw <- as.integer(raw[-c(8:10)]) >>>> sum(raw) >>> [1] 2147884 >>>> sum(log(raw)) >>> [1] 177201.2 >>> >>> If it is true that there is a R version specific header in serialized >>> objects, then the digest() function should exclude such header in >>> order to produce consistent results across R versions, because now >>> digest(1) gives different results. >>> >>> Thank you >>> >>> Henrik >>> >> ______________________________________________ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > -- Dr Paul Murrell Department of Statistics The University of Auckland Private Bag 92019 Auckland New Zealand 64 9 3737599 x85392 [EMAIL PROTECTED] http://www.stat.auckland.ac.nz/~paul/ ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel