On Sat, 2004-08-14 at 08:42, Tony Plate wrote: > At Friday 08:41 PM 8/13/2004, Marc Schwartz wrote: > >Part of that decision may depend upon how big the dataset is and what is > >intended to be done with the ID's: > > > > > object.size(1011001001001) > >[1] 36 > > > > > object.size("1011001001001") > >[1] 52 > > > > > object.size(factor("1011001001001")) > >[1] 244 > > > > > >They will by default, as Andy indicates, be read and stored as doubles. > >They are too large for integers, at least on my system: > > > > > .Machine$integer.max > >[1] 2147483647 > > > >Converting to a character might make sense, with only a minimal memory > >penalty. However, using a factor results in a notable memory penalty, if > >the attributes of a factor are not needed. > > That depends on how long the vectors are. The memory overhead for factors > is per vector, with only 4 bytes used for each additional element (if the > level already appears). The memory overhead for character data is per > element -- there is no amortization for repeated values. > > > object.size(factor("1011001001001")) > [1] 244 > > > object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),1))) > [1] 308 > > # bytes per element in factor, for length 4: > > > object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),1)))/4 > [1] 77 > > # bytes per element in factor, for length 1000: > > > object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),250)))/1000 > [1] 4.292 > > # bytes per element in character data, for length 1000: > > > object.size(as.character(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),250))))/1000 > [1] 20.028 > > > > So, for long vectors with relatively few different values, storage as > factors is far more memory efficient (this is because the character data is > stored only once per level, and each element is stored as a 4-byte > integer). (The above was done on Windows 2000). > > -- Tony Plate
Good point Tony. I was making the, perhaps incorrect assumption, that the ID's were unique or relatively so. However, as it turns out, even that assumption is relevant only to a certain extent with respect to how much memory is required. What is interesting (and presumably I need to do some more reading on how R stores objects internally) is that the incremental amount of memory is not consistent on a per element basis for a given object, though there is a pattern. It is also dependent upon the size of the new elements to be added, as I note at the bottom. This all of course presumes that object.size() is giving a reasonable approximation of the amount of memory actually allocated to an object, for which the notes in ?object.size raise at least some doubt. This is a critical assumption for the data below, which is on FC2 on a P4. For example: > object.size("a") [1] 44 > object.size(letters) [1] 340 In the second case, as Tony has noted, the size of letters (a character vector) is not 26 * 44. Now note: > object.size(c("a", "b")) [1] 52 > object.size(c("a", "b", "c")) [1] 68 > object.size(c("a", "b", "c", "d")) [1] 76 > object.size(c("a", "b", "c", "d", "e")) [1] 92 The incremental sizes are a sequence of 8 and 16. Now for a factor: > object.size(factor("a")) [1] 236 > object.size(factor(c("a", "b"))) [1] 244 > object.size(factor(c("a", "b", "c"))) [1] 268 > object.size(factor(c("a", "b", "c", "d"))) [1] 276 > object.size(factor(c("a", "b", "c", "d", "e"))) [1] 300 The incremental sizes are a sequence of 8 and 24. Using elements along the lines of Dan's: > object.size("1000000000000") [1] 52 > object.size(c("1000000000000", "1000000000001")) [1] 68 > object.size(c("1000000000000", "1000000000001", "1000000000002")) [1] 92 > object.size(c("1000000000000", "1000000000001", "1000000000002", "1000000000003")) [1] 108 > object.size(c("1000000000000", "1000000000001", "1000000000002", "1000000000003", "1000000000004")) [1] 132 The sequence is 16 and 24. For factors: > object.size(factor("1000000000000") [1] 244 > object.size(factor(c("1000000000000", "1000000000001"))) [1] 260 > object.size(factor(c("1000000000000", "1000000000001", "1000000000002"))) [1] 292 > object.size(factor(c("1000000000000", "1000000000001", "1000000000002", "1000000000003"))) [1] 308 > object.size(factor(c("1000000000000", "1000000000001", "1000000000002", "1000000000003", "1000000000004"))) [1] 340 The sequence is 24 and 32. So, the incremental size seems to alternate as elements are added. The behavior above would perhaps suggest that memory is allocated to objects to enable pairs of elements to be added. When the second element of the pair is added, only a minimal incremental amount of additional memory (and presumably time) is required. However, when I add a "third" element, there is additional memory required to store that new element because the object needs to be adjusted in a more fundamental way to handle this new element. There also appears to be some memory allocation "adjustment" at play here. Note: > object.size(factor("1000000000000")) [1] 244 > object.size(factor("1000000000000", "a")) [1] 236 In the second case, the amount of memory reported actually declines by 8 bytes. This suggests (to some extent consistent with my thoughts above) that when the object is initially created, there is space for two new elements and that space is allocated based upon the size of the first element. When the second element is added, the space required is adjusted based upon the actual size of the second element. Again, all of the above presumes that object.size() is reporting correct information. Thanks, Marc ______________________________________________ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html