> > Provide and 'str' and 'object.size' of the object > so that we can see what you are working with. My rule of thumb is > that no single object should take more than 25-30% of memory since > copies may be made. So the reasons things are taking 20 minutes is > you might be paging. It is always good to break the problem into > pieces to see what is happening. Read in only 25% of the data and > time it; then 50% and so on. In any performance related problems you > need to determine where the "knee of the curve" it. Never undertake > processing the large data file at once; start with some pieces and > work up so that you know what to expect. > > On Wed, Aug 18, 2010 at 9:46 PM, Matt Cooper <mattcst...@gmail.com> wrote: > > 2) My specific problem with this dataset. > > > > I am essentially trying to convert a date and add it to a data frame. I > > imagine any 'data manipulation on a column within dataframe into a new > > column' will present the same issue, be it as.Date or anything else. > > > > I have a dataset, size > > > >> dim(morbidity) > > [1] 1775683 264 > > > > This was read in from a STATA .dta file. The dates have come in as the > > number of ms from 1960 so I have the following to convert these to usable > > dates. > > > > as.Date(morbidity$adm_date / (100*10*60*60*24), origin="1960-01-01") > > > > when I store this as a vector it is near instant, <5 seconds > > test <- as.Date(etc) > > when I place it over itself it takes ~20 minutes > > morbidity$adm_date <- as.Date(etc) > > when I place the vector over it (so no computation involved), or place it > as > > a new column it still takes ~20 minutes > > morbidity$adm_date <- test > > morbidity$new_col <- test > > when I tried a cbind to add it that way it took >20 minutes > > new_morb <- cbind(morbidity,test) > > > > Has anyone done something similar or know of a different command that > should > > work faster? I can't get my head around what R is doing, if it can create > > the vector instantly then the computation is quite simple, I don't > > understand why then adding it as a column to a dataframe can take that > long. > > > > R64 bit on mac os x, 2.4 GHz dual core, 8gb ram so more than enough > > resources. >
Thanks Jim, results below. ~2.66 gig for the object so I guess there is no way to speed up working with that entire data frame. What I've done in the mean time is removed most of the columns down to what I want to do data manipulations with (approx 40 of the 264) and this is much much quicker, then I will join them back on at the end (am expecting that to take a while!). Any other feedback appreciated. > object.size(morbidity) 2865834800 bytes > str(morbidity) 'data.frame': 1775683 obs. of 264 variables: $ root : chr "2G5PQVQH5KYZY" "DDSMVGQEW9YXP" "DDSMVGQEW9YXP" "DDSMVGQEW9YXP" ... $ lpnot : chr "58GDA44MJSG3P" "4ZAM2XCK332NX" "5KX4FB6NTM831" "8CGXVV2A25C3M" ... $ hospital : int 226 616 633 633 616 631 631 631 616 629 ... $ hosp_area : int 2 1 1 1 1 1 1 1 1 1 ... $ hosp_region : int NA NA NA NA NA NA NA NA NA NA ... $ hosp_type : int 1 2 2 2 2 2 2 2 2 2 ... $ hosp_category : int 3 2 2 2 2 2 2 2 2 2 ... $ hsa : int NA NA NA NA NA NA NA NA NA NA ... $ adm_date :Class 'Date' num [1:1775683] 11079 11084 11534 11869 11051 ... $ adm_date_ddmwdob: int 0 0 450 785 0 70 122 125 0 91 ... $ sep_date :Class 'Date' num [1:1775683] 11089 11089 11534 11869 11057 ... $ sep_date_ddmwdob: int 10 5 450 785 6 70 122 136 23 98 ... $ adm_time : int 2345 750 630 630 651 930 930 930 1146 1728 ... $ sep_time : int 630 1715 1014 1013 1951 1630 1630 1000 941 1020 ... $ mf_los : int 10 5 1 1 6 1 1 8 23 7 ... $ suburb : chr "WARBURTON COMMUNITY" "WESTMINSTER" "WESTMINSTER" "WESTMINSTER" ... $ postcode : int 6431 6061 6061 6061 6150 6160 6160 6160 6016 6016 ... $ state : int 5 5 5 5 5 5 5 5 5 5 ... $ loc_code : chr "E06001" "" "" "" ... $ lga : int NA NA NA NA NA NA NA NA NA NA ... $ dob_my : num 1.27e+12 1.27e+12 1.27e+12 1.27e+12 1.27e+12 ... $ dob_ddmwdob : int 0 0 0 0 0 0 0 0 0 0 ... $ age : int 0 0 1 2 0 0 0 0 0 0 ... $ age_group : int 1 1 1 1 1 1 1 1 1 1 ... $ sex : int 1 2 2 2 2 2 2 2 2 2 ... $ aborig : int 1 4 4 4 4 4 4 4 4 4 ... $ cob : int 1105 1105 1100 1101 1105 3 3 3 1105 1105 ... $ marital : int 1 1 1 1 1 1 1 1 1 1 ... $ emp_stat : int 1 1 8 1 1 1 1 1 1 1 ... $ interp : int 2 2 2 2 2 2 2 2 2 2 ... $ occup : int 96 96 NA NA 96 96 NA NA 96 NA ... $ src_ref : int 0 0 NA NA 0 0 NA NA 0 NA ... $ pat_epi : int 2 2 1 1 2 1 1 2 2 2 ... $ adm_from : int 900 900 900 900 900 900 900 900 900 900 ... $ spl_adm : int 25 39 50 50 39 84 25 25 39 25 ... $ spl_sep : int 25 39 50 50 39 84 25 25 21 25 ... $ adm_type : int 1 1 4 4 1 1 3 3 1 4 ... $ d_o_leav : int 0 0 0 0 0 NA NA 3 0 NA ... $ psych_days : int NA NA NA NA NA NA NA NA NA NA ... $ mh_legal : int NA NA NA NA NA NA NA NA NA NA ... $ pay_clas : int 9 9 3 3 9 3 3 3 3 3 ... $ vet_ent : int NA NA NA NA NA NA NA NA NA NA ... $ ins_stat : int 2 1 1 1 1 1 1 1 1 1 ... $ days_icu : int 0 0 0 0 0 NA NA NA 20 NA ... $ hours_cmv : int 0 0 NA NA 0 NA NA NA 0 NA ... $ readmis : int NA NA NA NA NA NA NA NA NA NA ... $ ret_thea : int NA NA NA NA NA NA NA NA NA NA ... $ epi_care : int 6 6 21 21 6 1 21 21 6 21 ... $ pat_type : int 2 2 6 6 2 6 6 6 1 6 ... $ cont_hos : int NA NA NA NA NA NA NA NA NA NA ... $ sep_type : int 9 9 9 9 9 9 9 9 9 9 ... $ sep_to : int 900 900 900 900 900 900 900 900 900 900 ... $ language : int NA NA NA NA NA NA NA NA NA NA ... $ src_refl : int NA NA 1 1 NA NA 1 1 NA 1 ... $ src_refm : int NA NA 2 2 NA NA 1 1 NA 2 ... $ src_reft : int NA NA 1 1 NA NA 1 1 NA 1 ... $ accomod : int NA NA 2 2 NA NA 2 2 NA 2 ... $ dqualnew : int NA NA 0 0 NA NA NA NA NA 0 ... $ n_of_leav : int NA NA 0 0 NA NA NA 1 NA NA ... $ prev_treat : int NA NA NA NA NA NA NA NA NA NA ... $ sor : int NA NA NA NA NA NA NA NA NA NA ... $ further_care : int NA NA NA NA NA NA NA NA NA NA ... $ type_accomm : int NA NA NA NA NA NA NA NA NA NA ... $ hith : int NA NA NA 0 NA NA NA NA NA NA ... $ diag_imp_1 : chr "P" "P" "P" "P" ... $ diag_imp_2 : chr "" "" "" "" ... $ diag_imp_3 : chr "A" "" "" "" ... $ diag_imp_4 : chr "A" "" "" "" ... $ diag_imp_5 : chr "A" "" "" "" ... $ diag_imp_6 : chr "A" "" "" "" ... $ diag_imp_7 : chr "" "" "" "" ... $ diag_imp_8 : chr "" "" "" "" ... $ diag_imp_9 : chr "" "" "" "" ... $ diag_imp_10 : chr "" "" "" "" ... $ diag_imp_11 : chr "" "" "" "" ... $ diag_imp_12 : chr "" "" "" "" ... $ diag_imp_13 : chr "" "" "" "" ... $ diag_imp_14 : chr "" "" "" "" ... $ diag_imp_15 : chr "" "" "" "" ... $ diag_imp_16 : chr "" "" "" "" ... $ diag_imp_17 : chr "" "" "" "" ... $ diag_imp_18 : chr "" "" "" "" ... $ diag_imp_19 : chr "" "" "" "" ... $ diag_imp_20 : chr "" "" "" "" ... $ diag_imp_21 : chr "" "" "" "" ... $ diag_imp_22 : chr "" "" "" "" ... $ diag_seq_1 : int NA NA NA NA NA NA NA NA NA NA ... $ diag_seq_2 : int NA NA NA NA NA NA NA NA NA NA ... $ diag_seq_3 : int 4 NA NA NA NA 4 4 NA 4 4 ... $ diag_seq_4 : int 5 NA NA NA NA NA NA NA 5 NA ... $ diag_seq_5 : int 6 NA NA NA NA NA NA NA 6 NA ... $ diag_seq_6 : int 7 NA NA NA NA NA NA NA NA NA ... $ diag_seq_7 : int NA NA NA NA NA NA NA NA NA NA ... $ diag_seq_8 : int NA NA NA NA NA NA NA NA NA NA ... $ diag_seq_9 : int NA NA NA NA NA NA NA NA NA NA ... $ diag_seq_10 : int NA NA NA NA NA NA NA NA NA NA ... $ diag_seq_11 : int NA NA NA NA NA NA NA NA NA NA ... $ diag_seq_12 : int NA NA NA NA NA NA NA NA NA NA ... $ diag_seq_13 : int NA NA NA NA NA NA NA NA NA NA ... [list output truncated] - attr(*, "datalabel")= chr "" - attr(*, "time.stamp")= chr "" - attr(*, "formats")= chr "%13s" "%13s" "%8.0g" "%8.0g" ... - attr(*, "types")= int 13 13 252 251 251 251 251 251 255 252 ... - attr(*, "val.labels")= chr "" "" "" "" ... - attr(*, "var.labels")= chr "" "" "" "hosp_area" ... - attr(*, "version")= int 10 [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.