I have a dataset that's a few hundred thousand rows from a database (read in via dbreadTable). The database is like:
> str(measures) `data.frame': 609363 obs. of 5 variables: $ vih.id : int 1 2 3 4 5 6 7 8 9 10 ... $ vi.id : int 1 2 3 4 5 6 7 8 9 10 ... $ vih.value: chr "0" "1989" "0" "N/A" ... $ vih.date : chr "20040226012314" "20040226012315" "20040226012315" "20040226012315" ... $ vih.run.n: int 1 1 1 1 1 1 1 1 1 1 .. I'm reshaping it to be like > str(better) `data.frame': 132 obs. of 6311 variables: $ vih.run.n : int 1 2 4 5 6 7 8 9 10 11 ... $ vih.value.1 : chr "0" "0" "0" "0" ... $ vih.value.2 : chr "1989" "1989" "1989" "1989" ... $ vih.value.3 : chr "0" "0" "0" "0" ... $ vih.value.4 : chr "N/A" "N/A" "N/A" "N/A" ... $ vih.value.5 : chr "3163979" "3163979" "3163979" "3163979" ... $ vih.value.6 : chr "5500073" "5500073" "5500073" "5500073" ... (etc., etc.) This takes about 4-8 hours to accomplish. Should I a) try to put it into the wide format row by row as I get the data from the DB instead of using dbReadTable, or b) try to tune something in R? (I'm trying it now with R --min-vsize=600M --min-nsize=6M although it's not seeming fast; I won't know if it's faster for a while). (Using home compiled R 1.8.1 on Mac OS X 10.3.2, under emacs/ESS, although my R 1.8.1 on Solaris 2.8 has been churning for a few hours as well (on a split of the data that is 630 variables by 1000 obs). --Chris ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
