Hi -- It's my first post on this list; as a relatively new user with little knowledge of R internals, I am a bit intimidated by the depth of some of the discussions here, so please spare me if I say something incredibly silly.
I feel that someone at this point should mention Matthew Dowle's excellent data.table package (http://cran.r-project.org/web/packages/data.table/index.html) which seems to me to address many of the inefficiencies of data.frame. data.tables have no row names; and operations that only need data from one or two columns are (I believe) just as quick whether the total number of columns is 5 or 1000. This results in very quick operations (and, often, elegant code as well). Regards Timothee On Mon, Jul 4, 2011 at 6:19 AM, ivo welch <ivo.we...@gmail.com> wrote: > thank you, simon. this was very interesting indeed. I also now > understand how far out of my depth I am here. > > fortunately, as an end user, obviously, *I* now know how to avoid the > problem. I particularly like the as.list() transformation and back to > as.data.frame() to speed things up without loss of (much) > functionality. > > > more broadly, I view the avoidance of individual access through the > use of apply and vector operations as a mixed "IQ test" and "knowledge > test" (which I often fail). However, even for the most clever, there > are also situations where the KISS programming principle makes > explicit loops still preferable. Personally, I would have preferred > it if R had, in its standard "statistical data set" data structure, > foregone the row names feature in exchange for retaining fast direct > access. R could have reserved its current implementation "with row > names but slow access" for a less common (possibly pseudo-inheriting) > data structure. > > > If end users commonly do iterations over a data frame, which I would > guess to be the case, then the impression of R by (novice) end users > could be greatly enhanced if the extreme penalties could be eliminated > or at least flagged. For example, I wonder if modest special internal > code could store data frames internally and transparently as lists of > vectors UNTIL a row name is assigned to. Easier and uglier, a simple > but specific warning message could be issued with a suggestion if > there is an individual read/write into a data frame ("Warning: data > frames are much slower than lists of vectors for individual element > access"). > > > I would also suggest changing the "Introduction to R" 6.3 from "A > data frame may for many purposes be regarded as a matrix with columns > possibly of differing modes and attributes. It may be displayed in > matrix form, and its rows and columns extracted using matrix indexing > conventions." to "A data frame may for many purposes be regarded as a > matrix with columns possibly of differing modes and attributes. It may > be displayed in matrix form, and its rows and columns extracted using > matrix indexing conventions. However, data frames can be much slower > than matrices or even lists of vectors (which, like data frames, can > contain different types of columns) when individual elements need to > be accessed." Reading about it immediately upon introduction could > flag the problem in a more visible manner. > > > regards, > > /iaw > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel