Re: [R] Adding column to dataframe

2010-08-25 Thread nzcoops

To update on this. I ran the same command on a grid of computers with 32gb
ram, and it completed in 15 seconds, compared to the ~20 minutes on my
desktop.

Simply a ram issue as suspected by a few on the list here.

Thanks
-- 
View this message in context: 
http://r.789695.n4.nabble.com/Adding-column-to-dataframe-tp2330556p2339076.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Adding column to dataframe

2010-08-23 Thread Matt Cooper

 Provide and 'str' and 'object.size' of the object
 so that we can see what you are working with.  My rule of thumb is
 that no single object should take more than 25-30% of memory since
 copies may be made.  So the reasons things are taking 20 minutes is
 you might be paging.  It is always good to break the problem into
 pieces to see what is happening.  Read in only 25% of the data and
 time it; then 50% and so on.  In any performance related problems you
 need to determine where the knee of the curve it.  Never undertake
 processing the large data file at once; start with some pieces and
 work up so that you know what to expect.

 On Wed, Aug 18, 2010 at 9:46 PM, Matt Cooper mattcst...@gmail.com wrote:
   2) My specific problem with this dataset.
 
  I am essentially trying to convert a date and add it to a data frame. I
  imagine any 'data manipulation on a column within dataframe into a new
  column' will present the same issue, be it as.Date or anything else.
 
  I have a dataset, size
 
  dim(morbidity)
  [1] 1775683 264
 
  This was read in from a STATA .dta file. The dates have come in as the
  number of ms from 1960 so I have the following to convert these to usable
  dates.
 
  as.Date(morbidity$adm_date / (100*10*60*60*24), origin=1960-01-01)
 
  when I store this as a vector it is near instant, 5 seconds
  test - as.Date(etc)
  when I place it over itself it takes ~20 minutes
  morbidity$adm_date - as.Date(etc)
  when I place the vector over it (so no computation involved), or place it
 as
  a new column it still takes ~20 minutes
  morbidity$adm_date - test
  morbidity$new_col - test
  when I tried a cbind to add it that way it took 20 minutes
  new_morb - cbind(morbidity,test)
 
  Has anyone done something similar or know of a different command that
 should
  work faster? I can't get my head around what R is doing, if it can create
  the vector instantly then the computation is quite simple, I don't
  understand why then adding it as a column to a dataframe can take that
 long.
 
  R64 bit on mac os x, 2.4 GHz dual core, 8gb ram so more than enough
  resources.


 Thanks Jim, results below.

~2.66 gig for the object so I guess there is no way to speed up working with
that entire data frame. What I've done in the mean time is removed most of
the columns down to what I want to do data manipulations with (approx 40 of
the 264) and this is much much quicker, then I will join them back on at the
end (am expecting that to take a while!).

Any other feedback appreciated.

 object.size(morbidity)
2865834800 bytes
 str(morbidity)
'data.frame': 1775683 obs. of  264 variables:
 $ root: chr  2G5PQVQH5KYZY DDSMVGQEW9YXP DDSMVGQEW9YXP
DDSMVGQEW9YXP ...
 $ lpnot   : chr  58GDA44MJSG3P 4ZAM2XCK332NX 5KX4FB6NTM831
8CGXVV2A25C3M ...
 $ hospital: int  226 616 633 633 616 631 631 631 616 629 ...
 $ hosp_area   : int  2 1 1 1 1 1 1 1 1 1 ...
 $ hosp_region : int  NA NA NA NA NA NA NA NA NA NA ...
 $ hosp_type   : int  1 2 2 2 2 2 2 2 2 2 ...
 $ hosp_category   : int  3 2 2 2 2 2 2 2 2 2 ...
 $ hsa : int  NA NA NA NA NA NA NA NA NA NA ...
 $ adm_date:Class 'Date'  num [1:1775683] 11079 11084 11534 11869
11051 ...
 $ adm_date_ddmwdob: int  0 0 450 785 0 70 122 125 0 91 ...
 $ sep_date:Class 'Date'  num [1:1775683] 11089 11089 11534 11869
11057 ...
 $ sep_date_ddmwdob: int  10 5 450 785 6 70 122 136 23 98 ...
 $ adm_time: int  2345 750 630 630 651 930 930 930 1146 1728 ...
 $ sep_time: int  630 1715 1014 1013 1951 1630 1630 1000 941 1020
...
 $ mf_los  : int  10 5 1 1 6 1 1 8 23 7 ...
 $ suburb  : chr  WARBURTON COMMUNITY WESTMINSTER WESTMINSTER
WESTMINSTER ...
 $ postcode: int  6431 6061 6061 6061 6150 6160 6160 6160 6016 6016
...
 $ state   : int  5 5 5 5 5 5 5 5 5 5 ...
 $ loc_code: chr  E06001...
 $ lga : int  NA NA NA NA NA NA NA NA NA NA ...
 $ dob_my  : num  1.27e+12 1.27e+12 1.27e+12 1.27e+12 1.27e+12 ...
 $ dob_ddmwdob : int  0 0 0 0 0 0 0 0 0 0 ...
 $ age : int  0 0 1 2 0 0 0 0 0 0 ...
 $ age_group   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ sex : int  1 2 2 2 2 2 2 2 2 2 ...
 $ aborig  : int  1 4 4 4 4 4 4 4 4 4 ...
 $ cob : int  1105 1105 1100 1101 1105 3 3 3 1105 1105 ...
 $ marital : int  1 1 1 1 1 1 1 1 1 1 ...
 $ emp_stat: int  1 1 8 1 1 1 1 1 1 1 ...
 $ interp  : int  2 2 2 2 2 2 2 2 2 2 ...
 $ occup   : int  96 96 NA NA 96 96 NA NA 96 NA ...
 $ src_ref : int  0 0 NA NA 0 0 NA NA 0 NA ...
 $ pat_epi : int  2 2 1 1 2 1 1 2 2 2 ...
 $ adm_from: int  900 900 900 900 900 900 900 900 900 900 ...
 $ spl_adm : int  25 39 50 50 39 84 25 25 39 25 ...
 $ spl_sep : int  25 39 50 50 39 84 25 25 21 25 ...
 $ adm_type: int  1 1 4 4 1 1 3 3 1 4 ...
 $ d_o_leav: int  0 0 0 0 0 NA NA 3 0 NA ...
 $ psych_days  : int  NA NA NA NA NA 

Re: [R] Adding column to dataframe

2010-08-19 Thread jim holtman
I think you are probably paging on your system.  Turn on your
performance metrics and look at it.  If the object you are processing
is all numeric, it would seem to require about 3.5GB of space (50% of
available memory).  Provide and 'str' and 'object.size' of the object
so that we can see what you are working with.  My rule of thumb is
that no single object should take more than 25-30% of memory since
copies may be made.  So the reasons things are taking 20 minutes is
you might be paging.  It is always good to break the problem into
pieces to see what is happening.  Read in only 25% of the data and
time it; then 50% and so on.  In any performance related problems you
need to determine where the knee of the curve it.  Never undertake
processing the large data file at once; start with some pieces and
work up so that you know what to expect.

On Wed, Aug 18, 2010 at 9:46 PM, Matt Cooper mattcst...@gmail.com wrote:
 Two questions:
 1) Are there any good R guides/sites with information/techniques for dealing
 with large datasets in R? (Large being ~2 mil rows and ~200 columns)

 2) My specific problem with this dataset.

 I am essentially trying to convert a date and add it to a data frame. I
 imagine any 'data manipulation on a column within dataframe into a new
 column' will present the same issue, be it as.Date or anything else.

 I have a dataset, size

 dim(morbidity)
 [1] 1775683     264

 This was read in from a STATA .dta file. The dates have come in as the
 number of ms from 1960 so I have the following to convert these to usable
 dates.

 as.Date(morbidity$adm_date / (100*10*60*60*24), origin=1960-01-01)

 when I store this as a vector it is near instant, 5 seconds
 test - as.Date(etc)
 when I place it over itself it takes ~20 minutes
 morbidity$adm_date - as.Date(etc)
 when I place the vector over it (so no computation involved), or place it as
 a new column it still takes ~20 minutes
 morbidity$adm_date - test
 morbidity$new_col - test
 when I tried a cbind to add it that way it took 20 minutes
 new_morb - cbind(morbidity,test)

 Has anyone done something similar or know of a different command that should
 work faster? I can't get my head around what R is doing, if it can create
 the vector instantly then the computation is quite simple, I don't
 understand why then adding it as a column to a dataframe can take that long.

 R64 bit on mac os x, 2.4 GHz dual core, 8gb ram so more than enough
 resources.

 Thanks
 Matt

        [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.