On 5/5/05, Ted Harding <[EMAIL PROTECTED]> wrote: > On 05-May-05 Peter Dalgaard wrote: > > [...] > > Both systems are victims of the curse of the rectangular data set to > > some extent. Prototypically, you record the sex of a rat along with > > every single measurement on it, as if the rat could change sex at > > millisecond resolution. This probably applies to all current > > statistical systems, but there is some hope that R's more flexible > > data structures can be leveraged to better handle multilevel data. > > (Cue Probabilistic Relational Models a.m. Getoor et al., which Peter > > Green brought up at the recent gR meeting.) > > I would agree with this hope. Indeed I was reminded of the issue > by Alessandro Carletti's recent query about extracting features > from the data at different marine sampling stations. > > My involvement goes back to the days (around 1980) when, with > Jan Boëtius, I was examining Johannes Schmidt's data on eel larvae > obtained during his Atlantic cruises to investigate the "spawning > question" of the European eel (funded by the Carlsberg Foundation, > Peter!). > > Each Cruise consisted of a series of Stations by a given Ship > at different Geographic positions, at each of which a number of Hauls > would be made in different Years and different Months on different > Days at different Times of day, using different Equipments and at > different Depths or ranges of Depth, and of different Durations, > and at different Speeds, resulting in capture of none or several > specimens each of which would be examined for length, numbers of > myomeres (muscle segments), and other features, along with hydrographic > measurements. > > This could have been embodied in a huge "rectangular table" with of > course much repetition of all the information that remains constant > for each specimen in a haul. The specimen-specific data consisted of > only 2-4 items, while the "constant" data consisted of 12-15 > items. There were nearly 20,000 larvae, so the "rectangular table" > could have occupied well over a Megabyte. > > The alternative is a "list" representation, like: > > Investigation = list(Cruises) > Cruise = list(Ship,list(Stations)) > Station = list((Position,list(Hauls)) > Haul = list((Year,Month,Day,Time,Duration,(Equipment data),(Depths), > Speed,list(Specimens)) > Specimen=list(Length,Myomeres,...) > > In the end, the "list-like" view was the one adopted (I was limited > to CP/M BASIC in some 48K of free RAM, with 256KB floppies, in those > days), though not fully formally programmed (some of the "list > parsing" was done by hand, i.e. replacing one floppy with another), > though the BASIC program did retain the previously read data > for a given Station when reading in new Haul data, and the Haul > data when reading in Specimen data. > > Later, when I began to study C, I realised that the language > was well adapted to implementing such structures in a program, > though by then following this up would have been motivated by > curiosity rather than needing to get the job done (it already > was done). > > Now, in R, I see that in principle such data representations > are well integrated into the language, and I've been yet again > tempted to look at the question! > > However, while representing the raw data in such a form is > well supported by R, it seems to me that extracting data > in a way adapted to different analyses requires users to > create their own methods, using the list-access primitives . > > For example, to study the changes in the distribution of > lengths of specimens in relation to Position and Date > (which was one of the important issues in that investigation), > I don't think there are any "list processing" functions > available in R which, given the list-based structure described > above, would allow a simple query of the form > > means( Length , ~ Position:Date , data=Cruise ) > > It's quite feasible to write one's own; but I think Peter's > hope (expressed in excerpt above) looks like a first call > for thinking about general methods for this sort of thing. >
The Green Book defines a recursive apply function, rapply, that provides a general means of traversing that sort of structure. ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html