Hello List, i have been agonizing over this for days, any reply would be greatly appreciated! Situation:___________________________________ My original dataset is a .csv dataset (w/ 2M records) with 4 variables: job_id (Primary key, won't be used for analysis, just used for join tables), sector_id (categorical variable, for 19 industry sectors), sqft (con't variable for square footage), building_type (categorical, for 2 building types) some values of sqft were inputed wrong, so i'd like to set sqft<1 to "NA" and then use aregImpute() to impute those NAs. Problem: the origianl dataset(.csv format) is too large. though i could read that dataset into R, i could not get aregImpute() run even i set the memory limit to 3G ! (yes, i did the switch in windows to reach 3G rather than 2G) Goal: try to find a way to slim down my dataset so as to get aregImpute() running. What i did:________________________________ i searched in the archive, and found someone said, as R tends to inflate memory, it is a good idea to first read the original dataset into R--> then save it as a more compact binary file using save() --> and then reload the compact binary file back into R using load(). this way would reduce the memory allocation. HOWEVER, after i saved my original dataset into a compact binary file using save(), and used "load("filename.Rdata") to reload the new compact data format into R, I could not figure out how to retrive all my variables!!! R shows the new dataset is not a list, nor a matrix, or a dataframe, but just a character with length 1 !!! and there is no way i could do attach(). i generated a 1K-row subset out of my original dataset to illustrate my problem (does anyone know how to get my four variables back from this "compact binary" new dataset? what did i do wrong?): > data <- read.table (file.choose(),header=T,sep=",") > summary(data) job_id sector_id sqft building_type Min. : 1.0 Min. : 6.000 Min. : 0.00 Min. :1.000 1st Qu.: 250.8 1st Qu.: 6.000 1st Qu.: 3.00 1st Qu.:2.000 Median : 500.5 Median :11.000 Median : 4.00 Median :2.000 Mean : 500.5 Mean : 9.455 Mean : 12.49 Mean :1.996 3rd Qu.: 750.3 3rd Qu.:11.000 3rd Qu.: 4.00 3rd Qu.:2.000 Max. :1000.0 Max. :12.000 Max. :192.00 Max. :2.000 > > attach(data) > sqft[sqft<1] <- NA > sector.f <- as.factor(sector_id) > building_type.f <- as.factor (building_type) > d <- data.frame(job_id,sector.f,sqft, building_type.f) > summary (d) job_id sector.f sqft building_type.f Min. : 1.0 6 :340 Min. : 3.00 1: 4 1st Qu.: 250.8 11:505 1st Qu.: 4.00 2:996 Median : 500.5 12:155 Median : 4.00 Mean : 500.5 Mean : 14.16 3rd Qu.: 750.3 3rd Qu.: 17.00 Max. :1000.0 Max. :192.00 NA's :118.00 > save (d, file="compact_d.Rdata", ascii=FALSE) > > newdata <- load ("compact_d.Rdata") > > summary(newdata) Length Class Mode 1 character character > attach(newdata) Error in attach(newdata) : file 'd' not found > is.data.frame (newdata) [1] FALSE > is.list (newdata) [1] FALSE > is.matrix (newdata) [1] FALSE > _________________________________ btw, i also tried to just save (into compact binary) and reload (the new compact binary data format) (as i could do the "NA" stuff in sql anyhow). however, i still got stucked at the same spot: > data <- read.table (file.choose(),header=T,sep=",") > summary(data) job_id sector_id sqft building_type Min. : 1.0 Min. : 6.000 Min. : 0.00 Min. :1.000 1st Qu.: 250.8 1st Qu.: 6.000 1st Qu.: 3.00 1st Qu.:2.000 Median : 500.5 Median :11.000 Median : 4.00 Median :2.000 Mean : 500.5 Mean : 9.455 Mean : 12.49 Mean :1.996 3rd Qu.: 750.3 3rd Qu.:11.000 3rd Qu.: 4.00 3rd Qu.:2.000 Max. :1000.0 Max. :12.000 Max. :192.00 Max. :2.000 > save (data, file="compact_data.Rdata", ascii=FALSE) > newdata <- load ("compact_data.Rdata") > summary(newdata) Length Class Mode 1 character character > attach(newdata) Error: restore file may be empty -- no data loaded In addition: Warning message: file 'data' has magic number '' Use of save versions prior to 2 is deprecated > is.data.frame (newdata) [1] FALSE > is.list (newdata) [1] FALSE > is.matrix (newdata) [1] FALSE >
--------------------------------- Building a website is a piece of cake. [[alternative HTML version deleted]] ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.