On Fri, 9 Mar 2007, Paul Murrell wrote: > Hi > > > Luke Tierney wrote: >> On Wed, 7 Mar 2007, Henrik Bengtsson wrote: >> >>> To follow up, I went ahead and generated "random" object to scan for a >>> common header for a given R version, and it seems to be that at most >>> the first 18 bytes are non-data specific, which could be the length of >>> the serialization header. >>> >>> Here is my code for this: >>> >>> scanSerialize <- function(object, hdr=NULL, ...) { >>> # Serialize object >>> raw <- serialize(object, connection=NULL, ascii=TRUE); >>> >>> # First run? >>> if (is.null(hdr)) >>> return(raw); >>> >>> # Find differences between current longest header and new raw vector >>> n <- length(hdr); >>> diffs <- (as.integer(hdr) != as.integer(raw[1:n])); >>> >>> # No differences? >>> if (!any(diffs)) >>> return(hdr); >>> >>> # Position of first difference >>> idx <- which(diffs)[1]; >>> >>> # Keep common header >>> hdr <- hdr[seq_len(idx-1)]; >>> >>> hdr; >>> }; >>> >>> # Serialize a first "random" object >>> hdr <- scanSerialize(NA); >>> for (kk in 1:100) >>> hdr <- scanSerialize(kk, hdr=hdr); >>> for (kk in 1:100) { >>> x <- sample(letters, size=sample(100), replace=TRUE); >>> hdr <- scanSerialize(x, hdr=hdr); >>> } >>> for (kk in 1:100) { >>> hdr <- scanSerialize(kk, hdr=hdr); >>> hdr <- scanSerialize(hdr, hdr=hdr); >>> } >>> >>> cat("Length:", length(hdr), "\n"); >>> print(hdr); >>> print(rawToChar(hdr)); >>> >>> On R v2.5.0 devel, this gives: >>> Length: 18 >>> [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a >>> [1] "A\n2\n132352\n131840\n" >>> >>> However, it would still be good to get an "official" statement from >>> one in the R-code team about the serialization header and where the >>> data section start. Again, I want to cut out as much as possible for >>> consistency between R version without loosing data dependent bytes. >> >> An official, and definitive, statement from the _R-core_ team has been >> available to you all along at >> >> https://svn.r-project.org/R/trunk/src/main/serialize.c > > > There's also a bit of info on this in Section 1.7 of the "R Internals" > Manual. > > Paul
Thanks -- I'd forgotten about that. Looking at that shows that my unofficial and non-definitive interpretation was not quite right for the binary case -- the header there is 14 bytes (I forgot that there is a \n after the X even in the binary case). Best, luke > > >> My unofficial and non-definitive interpretation of that statement is >> that there is a header of four items, >> >> A format code 'A' or 'X' ('B' also possible in older formats) >> version number of the format >> Packed integer containint the R version that did the serializing >> Packed integer containing the oldest R version that can read the format >> >> You can see this if you look at the ascii version as text: >> >> > serialize(1, stdout(), ascii=TRUE) >> A >> 2 >> 132097 >> 131840 >> 14 >> 1 >> 1 >> NULL >> > serialize(as.integer(1), stdout(), ascii=TRUE) >> A >> 2 >> 132097 >> 131840 >> 13 >> 1 >> 1 >> NULL >> >> In the non-ascii 'X' (as in xdr) format this will constitute 13 bytes. >> In ascii format I believe it is currently 18 bytes but this could >> change with the version number of R -- I'd have to read the official >> and definitive statement to see how the integer packing is done and >> work out whether that could change the number of bytes. The number of >> bytes would also change if we reached format version 10, but something >> about the format would also change of course. A safer way to look at >> the header in the ascii version is as the first four lines. >> >> Best, >> >> luke >> >>> Thanks >>> >>> /Henrik >>> >>> On 3/7/07, Henrik Bengtsson <[EMAIL PROTECTED]> wrote: >>>> Hi, >>>> >>>> I noticed that serialize() gives different results depending on R >>>> version, which has implications to the digest() function in the digest >>>> package. Note, it does give the same output across platforms. I know >>>> that serialize() is under development, but is this expected, e.g. is >>>> there some kind of header in the result that specifies "who" generated >>>> the stream, and if so, exactly what bytes are they? >>>> >>>> SETUP: >>>> >>>> R versions: >>>> A) R v2.4.0 (2006-10-03) >>>> B) R v2.4.1pat (2007-01-13 r40470) >>>> C) R v2.5.0dev (2006-12-12 r40167) >>>> >>>> This is on WinXP and I start R with Rterm --vanilla. >>>> >>>> Example: Identical serialize() calls using the different R versions. >>>> >>>>> raw <- serialize(1, connection=NULL, ascii=TRUE) >>>>> print(raw) >>>> gives: >>>> >>>> (A): [1] 41 0a 32 0a 31 33 32 30 39 36 0a 31 33 31 38 34 30 0a 31 34 >>>> 0a 31 0a 31 0a >>>> (B): [1] 41 0a 32 0a 31 33 32 30 39 37 0a 31 33 31 38 34 30 0a 31 34 >>>> 0a 31 0a 31 0a >>>> (C): [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a 31 34 >>>> 0a 31 0a 31 0a >>>> >>>> Note the difference in raw bytes 8 to 10, i.e. >>>> >>>>> raw[7:11] >>>> (A): [1] 32 30 39 36 0a >>>> (B): [1] 32 30 39 37 0a >>>> (C): [1] 32 33 35 32 0a >>>> >>>> Does bytes 8, 9 and 10 in the raw vector somehow contain information >>>> about the R version or similar? The following poor mans test says >>>> that is the only difference: >>>> >>>> On all R versions, the following gives identical results: >>>> >>>>> raw <- serialize(1:1e4, connection=NULL, ascii=TRUE) >>>>> raw <- as.integer(raw[-c(8:10)]) >>>>> sum(raw) >>>> [1] 2147884 >>>>> sum(log(raw)) >>>> [1] 177201.2 >>>> >>>> If it is true that there is a R version specific header in serialized >>>> objects, then the digest() function should exclude such header in >>>> order to produce consistent results across R versions, because now >>>> digest(1) gives different results. >>>> >>>> Thank you >>>> >>>> Henrik >>>> >>> ______________________________________________ >>> R-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >> > > -- Luke Tierney Chair, Statistics and Actuarial Science Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics and Fax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: [EMAIL PROTECTED] Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel