On 3/8/07, Luke Tierney <[EMAIL PROTECTED]> wrote: > On Fri, 9 Mar 2007, Paul Murrell wrote: > > > Hi > > > > > > Luke Tierney wrote: > >> On Wed, 7 Mar 2007, Henrik Bengtsson wrote: > >> > >>> To follow up, I went ahead and generated "random" object to scan for a > >>> common header for a given R version, and it seems to be that at most > >>> the first 18 bytes are non-data specific, which could be the length of > >>> the serialization header. > >>> > >>> Here is my code for this: > >>> > >>> scanSerialize <- function(object, hdr=NULL, ...) { > >>> # Serialize object > >>> raw <- serialize(object, connection=NULL, ascii=TRUE); > >>> > >>> # First run? > >>> if (is.null(hdr)) > >>> return(raw); > >>> > >>> # Find differences between current longest header and new raw vector > >>> n <- length(hdr); > >>> diffs <- (as.integer(hdr) != as.integer(raw[1:n])); > >>> > >>> # No differences? > >>> if (!any(diffs)) > >>> return(hdr); > >>> > >>> # Position of first difference > >>> idx <- which(diffs)[1]; > >>> > >>> # Keep common header > >>> hdr <- hdr[seq_len(idx-1)]; > >>> > >>> hdr; > >>> }; > >>> > >>> # Serialize a first "random" object > >>> hdr <- scanSerialize(NA); > >>> for (kk in 1:100) > >>> hdr <- scanSerialize(kk, hdr=hdr); > >>> for (kk in 1:100) { > >>> x <- sample(letters, size=sample(100), replace=TRUE); > >>> hdr <- scanSerialize(x, hdr=hdr); > >>> } > >>> for (kk in 1:100) { > >>> hdr <- scanSerialize(kk, hdr=hdr); > >>> hdr <- scanSerialize(hdr, hdr=hdr); > >>> } > >>> > >>> cat("Length:", length(hdr), "\n"); > >>> print(hdr); > >>> print(rawToChar(hdr)); > >>> > >>> On R v2.5.0 devel, this gives: > >>> Length: 18 > >>> [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a > >>> [1] "A\n2\n132352\n131840\n" > >>> > >>> However, it would still be good to get an "official" statement from > >>> one in the R-code team about the serialization header and where the > >>> data section start. Again, I want to cut out as much as possible for > >>> consistency between R version without loosing data dependent bytes. > >> > >> An official, and definitive, statement from the _R-core_ team has been > >> available to you all along at > >> > >> https://svn.r-project.org/R/trunk/src/main/serialize.c > > > > > > There's also a bit of info on this in Section 1.7 of the "R Internals" > > Manual. > > > > Paul > > Thanks -- I'd forgotten about that. Looking at that shows that my > unofficial and non-definitive interpretation was not quite right for > the binary case -- the header there is 14 bytes (I forgot that there > is a \n after the X even in the binary case).
Luke and Paul, thank you for this. Searching for the 4th newline seems to be the most robust thing to do in the ASCII case. /Henrik > > Best, > > luke > > > > > > >> My unofficial and non-definitive interpretation of that statement is > >> that there is a header of four items, > >> > >> A format code 'A' or 'X' ('B' also possible in older formats) > >> version number of the format > >> Packed integer containint the R version that did the serializing > >> Packed integer containing the oldest R version that can read the > >> format > >> > >> You can see this if you look at the ascii version as text: > >> > >> > serialize(1, stdout(), ascii=TRUE) > >> A > >> 2 > >> 132097 > >> 131840 > >> 14 > >> 1 > >> 1 > >> NULL > >> > serialize(as.integer(1), stdout(), ascii=TRUE) > >> A > >> 2 > >> 132097 > >> 131840 > >> 13 > >> 1 > >> 1 > >> NULL > >> > >> In the non-ascii 'X' (as in xdr) format this will constitute 13 bytes. > >> In ascii format I believe it is currently 18 bytes but this could > >> change with the version number of R -- I'd have to read the official > >> and definitive statement to see how the integer packing is done and > >> work out whether that could change the number of bytes. The number of > >> bytes would also change if we reached format version 10, but something > >> about the format would also change of course. A safer way to look at > >> the header in the ascii version is as the first four lines. > >> > >> Best, > >> > >> luke > >> > >>> Thanks > >>> > >>> /Henrik > >>> > >>> On 3/7/07, Henrik Bengtsson <[EMAIL PROTECTED]> wrote: > >>>> Hi, > >>>> > >>>> I noticed that serialize() gives different results depending on R > >>>> version, which has implications to the digest() function in the digest > >>>> package. Note, it does give the same output across platforms. I know > >>>> that serialize() is under development, but is this expected, e.g. is > >>>> there some kind of header in the result that specifies "who" generated > >>>> the stream, and if so, exactly what bytes are they? > >>>> > >>>> SETUP: > >>>> > >>>> R versions: > >>>> A) R v2.4.0 (2006-10-03) > >>>> B) R v2.4.1pat (2007-01-13 r40470) > >>>> C) R v2.5.0dev (2006-12-12 r40167) > >>>> > >>>> This is on WinXP and I start R with Rterm --vanilla. > >>>> > >>>> Example: Identical serialize() calls using the different R versions. > >>>> > >>>>> raw <- serialize(1, connection=NULL, ascii=TRUE) > >>>>> print(raw) > >>>> gives: > >>>> > >>>> (A): [1] 41 0a 32 0a 31 33 32 30 39 36 0a 31 33 31 38 34 30 0a 31 34 > >>>> 0a 31 0a 31 0a > >>>> (B): [1] 41 0a 32 0a 31 33 32 30 39 37 0a 31 33 31 38 34 30 0a 31 34 > >>>> 0a 31 0a 31 0a > >>>> (C): [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a 31 34 > >>>> 0a 31 0a 31 0a > >>>> > >>>> Note the difference in raw bytes 8 to 10, i.e. > >>>> > >>>>> raw[7:11] > >>>> (A): [1] 32 30 39 36 0a > >>>> (B): [1] 32 30 39 37 0a > >>>> (C): [1] 32 33 35 32 0a > >>>> > >>>> Does bytes 8, 9 and 10 in the raw vector somehow contain information > >>>> about the R version or similar? The following poor mans test says > >>>> that is the only difference: > >>>> > >>>> On all R versions, the following gives identical results: > >>>> > >>>>> raw <- serialize(1:1e4, connection=NULL, ascii=TRUE) > >>>>> raw <- as.integer(raw[-c(8:10)]) > >>>>> sum(raw) > >>>> [1] 2147884 > >>>>> sum(log(raw)) > >>>> [1] 177201.2 > >>>> > >>>> If it is true that there is a R version specific header in serialized > >>>> objects, then the digest() function should exclude such header in > >>>> order to produce consistent results across R versions, because now > >>>> digest(1) gives different results. > >>>> > >>>> Thank you > >>>> > >>>> Henrik > >>>> > >>> ______________________________________________ > >>> R-devel@r-project.org mailing list > >>> https://stat.ethz.ch/mailman/listinfo/r-devel > >>> > >> > > > > > > -- > Luke Tierney > Chair, Statistics and Actuarial Science > Ralph E. Wareham Professor of Mathematical Sciences > University of Iowa Phone: 319-335-3386 > Department of Statistics and Fax: 319-335-3017 > Actuarial Science > 241 Schaeffer Hall email: [EMAIL PROTECTED] > Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel