Re: [R] Adding column to dataframe

Matt Cooper Mon, 23 Aug 2010 20:29:01 -0700

>
> Provide and 'str' and 'object.size' of the object
> so that we can see what you are working with.  My rule of thumb is
> that no single object should take more than 25-30% of memory since
> copies may be made.  So the reasons things are taking 20 minutes is
> you might be paging.  It is always good to break the problem into
> pieces to see what is happening.  Read in only 25% of the data and
> time it; then 50% and so on.  In any performance related problems you
> need to determine where the "knee of the curve" it.  Never undertake
> processing the large data file at once; start with some pieces and
> work up so that you know what to expect.
>
> On Wed, Aug 18, 2010 at 9:46 PM, Matt Cooper <mattcst...@gmail.com> wrote:
>  > 2) My specific problem with this dataset.
> >
> > I am essentially trying to convert a date and add it to a data frame. I
> > imagine any 'data manipulation on a column within dataframe into a new
> > column' will present the same issue, be it as.Date or anything else.
> >
> > I have a dataset, size
> >
> >> dim(morbidity)
> > [1] 1775683     264
> >
> > This was read in from a STATA .dta file. The dates have come in as the
> > number of ms from 1960 so I have the following to convert these to usable
> > dates.
> >
> > as.Date(morbidity$adm_date / (100*10*60*60*24), origin="1960-01-01")
> >
> > when I store this as a vector it is near instant, <5 seconds
> > test <- as.Date(etc)
> > when I place it over itself it takes ~20 minutes
> > morbidity$adm_date <- as.Date(etc)
> > when I place the vector over it (so no computation involved), or place it
> as
> > a new column it still takes ~20 minutes
> > morbidity$adm_date <- test
> > morbidity$new_col <- test
> > when I tried a cbind to add it that way it took >20 minutes
> > new_morb <- cbind(morbidity,test)
> >
> > Has anyone done something similar or know of a different command that
> should
> > work faster? I can't get my head around what R is doing, if it can create
> > the vector instantly then the computation is quite simple, I don't
> > understand why then adding it as a column to a dataframe can take that
> long.
> >
> > R64 bit on mac os x, 2.4 GHz dual core, 8gb ram so more than enough
> > resources.
>


 Thanks Jim, results below.

~2.66 gig for the object so I guess there is no way to speed up working with
that entire data frame. What I've done in the mean time is removed most of
the columns down to what I want to do data manipulations with (approx 40 of
the 264) and this is much much quicker, then I will join them back on at the
end (am expecting that to take a while!).

Any other feedback appreciated.

> object.size(morbidity)
2865834800 bytes
> str(morbidity)
'data.frame': 1775683 obs. of  264 variables:
 $ root            : chr  "2G5PQVQH5KYZY" "DDSMVGQEW9YXP" "DDSMVGQEW9YXP"
"DDSMVGQEW9YXP" ...
 $ lpnot           : chr  "58GDA44MJSG3P" "4ZAM2XCK332NX" "5KX4FB6NTM831"
"8CGXVV2A25C3M" ...
 $ hospital        : int  226 616 633 633 616 631 631 631 616 629 ...
 $ hosp_area       : int  2 1 1 1 1 1 1 1 1 1 ...
 $ hosp_region     : int  NA NA NA NA NA NA NA NA NA NA ...
 $ hosp_type       : int  1 2 2 2 2 2 2 2 2 2 ...
 $ hosp_category   : int  3 2 2 2 2 2 2 2 2 2 ...
 $ hsa             : int  NA NA NA NA NA NA NA NA NA NA ...
 $ adm_date        :Class 'Date'  num [1:1775683] 11079 11084 11534 11869
11051 ...
 $ adm_date_ddmwdob: int  0 0 450 785 0 70 122 125 0 91 ...
 $ sep_date        :Class 'Date'  num [1:1775683] 11089 11089 11534 11869
11057 ...
 $ sep_date_ddmwdob: int  10 5 450 785 6 70 122 136 23 98 ...
 $ adm_time        : int  2345 750 630 630 651 930 930 930 1146 1728 ...
 $ sep_time        : int  630 1715 1014 1013 1951 1630 1630 1000 941 1020
...
 $ mf_los          : int  10 5 1 1 6 1 1 8 23 7 ...
 $ suburb          : chr  "WARBURTON COMMUNITY" "WESTMINSTER" "WESTMINSTER"
"WESTMINSTER" ...
 $ postcode        : int  6431 6061 6061 6061 6150 6160 6160 6160 6016 6016
...
 $ state           : int  5 5 5 5 5 5 5 5 5 5 ...
 $ loc_code        : chr  "E06001" "" "" "" ...
 $ lga             : int  NA NA NA NA NA NA NA NA NA NA ...
 $ dob_my          : num  1.27e+12 1.27e+12 1.27e+12 1.27e+12 1.27e+12 ...
 $ dob_ddmwdob     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ age             : int  0 0 1 2 0 0 0 0 0 0 ...
 $ age_group       : int  1 1 1 1 1 1 1 1 1 1 ...
 $ sex             : int  1 2 2 2 2 2 2 2 2 2 ...
 $ aborig          : int  1 4 4 4 4 4 4 4 4 4 ...
 $ cob             : int  1105 1105 1100 1101 1105 3 3 3 1105 1105 ...
 $ marital         : int  1 1 1 1 1 1 1 1 1 1 ...
 $ emp_stat        : int  1 1 8 1 1 1 1 1 1 1 ...
 $ interp          : int  2 2 2 2 2 2 2 2 2 2 ...
 $ occup           : int  96 96 NA NA 96 96 NA NA 96 NA ...
 $ src_ref         : int  0 0 NA NA 0 0 NA NA 0 NA ...
 $ pat_epi         : int  2 2 1 1 2 1 1 2 2 2 ...
 $ adm_from        : int  900 900 900 900 900 900 900 900 900 900 ...
 $ spl_adm         : int  25 39 50 50 39 84 25 25 39 25 ...
 $ spl_sep         : int  25 39 50 50 39 84 25 25 21 25 ...
 $ adm_type        : int  1 1 4 4 1 1 3 3 1 4 ...
 $ d_o_leav        : int  0 0 0 0 0 NA NA 3 0 NA ...
 $ psych_days      : int  NA NA NA NA NA NA NA NA NA NA ...
 $ mh_legal        : int  NA NA NA NA NA NA NA NA NA NA ...
 $ pay_clas        : int  9 9 3 3 9 3 3 3 3 3 ...
 $ vet_ent         : int  NA NA NA NA NA NA NA NA NA NA ...
 $ ins_stat        : int  2 1 1 1 1 1 1 1 1 1 ...
 $ days_icu        : int  0 0 0 0 0 NA NA NA 20 NA ...
 $ hours_cmv       : int  0 0 NA NA 0 NA NA NA 0 NA ...
 $ readmis         : int  NA NA NA NA NA NA NA NA NA NA ...
 $ ret_thea        : int  NA NA NA NA NA NA NA NA NA NA ...
 $ epi_care        : int  6 6 21 21 6 1 21 21 6 21 ...
 $ pat_type        : int  2 2 6 6 2 6 6 6 1 6 ...
 $ cont_hos        : int  NA NA NA NA NA NA NA NA NA NA ...
 $ sep_type        : int  9 9 9 9 9 9 9 9 9 9 ...
 $ sep_to          : int  900 900 900 900 900 900 900 900 900 900 ...
 $ language        : int  NA NA NA NA NA NA NA NA NA NA ...
 $ src_refl        : int  NA NA 1 1 NA NA 1 1 NA 1 ...
 $ src_refm        : int  NA NA 2 2 NA NA 1 1 NA 2 ...
 $ src_reft        : int  NA NA 1 1 NA NA 1 1 NA 1 ...
 $ accomod         : int  NA NA 2 2 NA NA 2 2 NA 2 ...
 $ dqualnew        : int  NA NA 0 0 NA NA NA NA NA 0 ...
 $ n_of_leav       : int  NA NA 0 0 NA NA NA 1 NA NA ...
 $ prev_treat      : int  NA NA NA NA NA NA NA NA NA NA ...
 $ sor             : int  NA NA NA NA NA NA NA NA NA NA ...
 $ further_care    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ type_accomm     : int  NA NA NA NA NA NA NA NA NA NA ...
 $ hith            : int  NA NA NA 0 NA NA NA NA NA NA ...
 $ diag_imp_1      : chr  "P" "P" "P" "P" ...
 $ diag_imp_2      : chr  "" "" "" "" ...
 $ diag_imp_3      : chr  "A" "" "" "" ...
 $ diag_imp_4      : chr  "A" "" "" "" ...
 $ diag_imp_5      : chr  "A" "" "" "" ...
 $ diag_imp_6      : chr  "A" "" "" "" ...
 $ diag_imp_7      : chr  "" "" "" "" ...
 $ diag_imp_8      : chr  "" "" "" "" ...
 $ diag_imp_9      : chr  "" "" "" "" ...
 $ diag_imp_10     : chr  "" "" "" "" ...
 $ diag_imp_11     : chr  "" "" "" "" ...
 $ diag_imp_12     : chr  "" "" "" "" ...
 $ diag_imp_13     : chr  "" "" "" "" ...
 $ diag_imp_14     : chr  "" "" "" "" ...
 $ diag_imp_15     : chr  "" "" "" "" ...
 $ diag_imp_16     : chr  "" "" "" "" ...
 $ diag_imp_17     : chr  "" "" "" "" ...
 $ diag_imp_18     : chr  "" "" "" "" ...
 $ diag_imp_19     : chr  "" "" "" "" ...
 $ diag_imp_20     : chr  "" "" "" "" ...
 $ diag_imp_21     : chr  "" "" "" "" ...
 $ diag_imp_22     : chr  "" "" "" "" ...
 $ diag_seq_1      : int  NA NA NA NA NA NA NA NA NA NA ...
 $ diag_seq_2      : int  NA NA NA NA NA NA NA NA NA NA ...
 $ diag_seq_3      : int  4 NA NA NA NA 4 4 NA 4 4 ...
 $ diag_seq_4      : int  5 NA NA NA NA NA NA NA 5 NA ...
 $ diag_seq_5      : int  6 NA NA NA NA NA NA NA 6 NA ...
 $ diag_seq_6      : int  7 NA NA NA NA NA NA NA NA NA ...
 $ diag_seq_7      : int  NA NA NA NA NA NA NA NA NA NA ...
 $ diag_seq_8      : int  NA NA NA NA NA NA NA NA NA NA ...
 $ diag_seq_9      : int  NA NA NA NA NA NA NA NA NA NA ...
 $ diag_seq_10     : int  NA NA NA NA NA NA NA NA NA NA ...
 $ diag_seq_11     : int  NA NA NA NA NA NA NA NA NA NA ...
 $ diag_seq_12     : int  NA NA NA NA NA NA NA NA NA NA ...
 $ diag_seq_13     : int  NA NA NA NA NA NA NA NA NA NA ...
  [list output truncated]
 - attr(*, "datalabel")= chr ""
 - attr(*, "time.stamp")= chr ""
 - attr(*, "formats")= chr  "%13s" "%13s" "%8.0g" "%8.0g" ...
 - attr(*, "types")= int  13 13 252 251 251 251 251 251 255 252 ...
 - attr(*, "val.labels")= chr  "" "" "" "" ...
 - attr(*, "var.labels")= chr  "" "" "" "hosp_area" ...
 - attr(*, "version")= int 10

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Adding column to dataframe

Reply via email to