On Fri, Mar 18, 2016 at 8:28 PM, Roy Mendelssohn - NOAA Federal
<roy.mendelss...@noaa.gov> wrote:
> Hi Henrik:
>
> I want to do want in oceanography is called an EOF, which is just a PCA 
> analysis. Unless I am missing something, in R I need to flatten my 3-D matrix 
> into a 2-D data matrix. I can fit the entire 30GB matrix into memory, and I 
> believe I have enough memory to do the PCA by constraining the number of 
> components returned .  What I don’t think I have enough memory for is an 
> operation that makes a copy of the matrix.
>
> As I said, in theory I know how to do the flattening, it a simple command, 
> but in practice I don’t have enough memory.  So I spent the afternoon 
> rewriting my code to read in parts of the data at a time and then putting 
> those in the appropriate places of a matrix already flattened of appropriate 
> size.  In case someone is wondering, on the 3D grid the matrix is 
> [1001,1001,3650].  So I create an empty matrix size [1001*1001,3650], and 
> read in a slice of the lat-lon grid, and map those into the appropriate 
> places in the flattened matrix.  By reading in appropriately sized chunks  my 
> memory usage is not pushed too far.

Sounds good.  There's another small caveat. Make sure to specify the
'data' argument for matrix() we allocating an "empty" matrix, e.g.

    X <- matrix(NA_real_, nrow=1001*1001, ncol=3650)

This will give you a double matrix with all missing value.  If you use
the default

    X <- matrix(nrow=1001*1001, ncol=3650)

you'll get a logical matrix, which will introduce a copy as soon as
you assign a double value (e.g. X[1,1] <- 3.14). The latter is a
complete waste of memory/time. See
http://www.jottr.org/2014/06/matrixNA-wrong-way.html for details.

/Henrik

>
> -Roy
>
>
>> On Mar 18, 2016, at 7:37 PM, Henrik Bengtsson <henrik.bengts...@gmail.com> 
>> wrote:
>>
>> On Fri, Mar 18, 2016 at 3:15 PM, Roy Mendelssohn - NOAA Federal
>> <roy.mendelss...@noaa.gov> wrote:
>>> Thanks.  That is what I needed to know.  I don’t want to play around with 
>>> some of the other suggestions, as I don’t totally understand what they do, 
>>> and don’t want to risk messing up something and not be aware of it.
>>>
>>> There is a way to read in the data chunks at a time and reshape it and put, 
>>> it into the (reshaped) larger array, harder to program but probably worth 
>>> the pain to be certain of what I am doing.
>>
>> I recommend this approach; whenever I work with reasonable large data
>> (that may become even larger in the future), I try to implement a
>> constant-memory version of the algorithm, which often comes down to
>> processing data in chunks.  The simplest version of this is to read
>> all data into memory and the subset, but if you can read data in in
>> chunks that is even better.
>>
>> Though, I'm curious to what matrix operations you wish to perform.
>> Because if you wish to do regular summation, then base::.rowSums() and
>> base::.colSums() allow you to override the default dimensions on the
>> fly without inducing extra copies, e.g.
>>
>>> X <- array(1:24, dim=c(2,3,4))
>>> .rowSums(X, m=6, n=4)
>> [1] 40 44 48 52 56 60
>>> rowSums(matrix(X, nrow=6, ncol=4))
>> [1] 40 44 48 52 56 60
>>
>> For other types of calculations, you might want to look at
>> matrixStats.  It has partial(*) support for overriding the default
>> dimensions in a similar fashion.  For instance,
>>
>>> library("matrixStats")
>>> rowVars(X, dim.=c(6,4))
>>
>> The above effectively calculates rowVars(matrix(X, nrow=6, ncol=4))
>> without making copies.
>>
>> (*) By partial I mean that this is a feature that hasn't been pushed
>> through to all of matrixStats functions, cf.
>> https://github.com/HenrikBengtsson/matrixStats/issues/83.
>>
>> Cheers,
>>
>> Henrik
>> (author of matrixStats)
>>
>>>
>>> I had a feeling a copy was made, just wanted to make certain of it.
>>>
>>> Thanks again,
>>>
>>> -Roy
>>>
>>>> On Mar 18, 2016, at 2:56 PM, Dénes Tóth <toth.de...@ttk.mta.hu> wrote:
>>>>
>>>> Hi Roy,
>>>>
>>>> R (usually) makes a copy if the dimensionality of an array is modified, 
>>>> even if you use this syntax:
>>>> x <- array(1:24, c(2, 3, 4))
>>>> dim(x) <- c(6, 4)
>>>>
>>>> See also ?tracemem, ?data.table::address, ?pryr::address and other tools 
>>>> to trace if an internal copy is done.
>>>>
>>>> Workaround: use data.table::setattr or bit::setattr to modify the 
>>>> dimensions in place (i.e., without making a copy). Risk: if you modify an 
>>>> object by reference, all other objects which point to the same memory 
>>>> address will be modified silently, too.
>>>>
>>>> HTH,
>>>> Denes
>>>>
>>>>
>>>>
>>>> On 03/18/2016 10:28 PM, Roy Mendelssohn - NOAA Federal wrote:
>>>>> Hi All:
>>>>>
>>>>> I am working with a very large array.  if noLat is the number of 
>>>>> latitudes, noLon the number of longitudes and noTime the number of  time 
>>>>> periods, the array is of the form:
>>>>>
>>>>> myData[noLat, no Lon, noTime].
>>>>>
>>>>> It is read in this way because that is how it is stored in a (series) of 
>>>>> netcdf files.  For the analysis I need to do, I need instead the array:
>>>>>
>>>>> myData[noLat*noLon, noTime].  Normally this would be easy:
>>>>>
>>>>> myData<- array(myData,dim=c(noLat*noLon,noTime))
>>>>>
>>>>> My question is how does this command work in R - does it make a copy of 
>>>>> the existing array, with different indices for the dimensions, or does it 
>>>>> just redo the indices and leave the given array as is?  The reason for 
>>>>> this question is my array is 30GB in memory, and I don’t have enough 
>>>>> space to have a copy of the array in memory.  If the latter I will have 
>>>>> to figure out a work around to bring in only part of the data at a time 
>>>>> and put it into the proper locations.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -Roy
>>>>>
>>>>>
>>>>>
>>>>> **********************
>>>>> "The contents of this message do not reflect any position of the U.S. 
>>>>> Government or NOAA."
>>>>> **********************
>>>>> Roy Mendelssohn
>>>>> Supervisory Operations Research Analyst
>>>>> NOAA/NMFS
>>>>> Environmental Research Division
>>>>> Southwest Fisheries Science Center
>>>>> ***Note new address and phone***
>>>>> 110 Shaffer Road
>>>>> Santa Cruz, CA 95060
>>>>> Phone: (831)-420-3666
>>>>> Fax: (831) 420-3980
>>>>> e-mail: roy.mendelss...@noaa.gov www: http://www.pfeg.noaa.gov/
>>>>>
>>>>> "Old age and treachery will overcome youth and skill."
>>>>> "From those who have been given much, much will be expected"
>>>>> "the arc of the moral universe is long, but it bends toward justice" -MLK 
>>>>> Jr.
>>>>>
>>>>> ______________________________________________
>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide 
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>
>>> **********************
>>> "The contents of this message do not reflect any position of the U.S. 
>>> Government or NOAA."
>>> **********************
>>> Roy Mendelssohn
>>> Supervisory Operations Research Analyst
>>> NOAA/NMFS
>>> Environmental Research Division
>>> Southwest Fisheries Science Center
>>> ***Note new address and phone***
>>> 110 Shaffer Road
>>> Santa Cruz, CA 95060
>>> Phone: (831)-420-3666
>>> Fax: (831) 420-3980
>>> e-mail: roy.mendelss...@noaa.gov www: http://www.pfeg.noaa.gov/
>>>
>>> "Old age and treachery will overcome youth and skill."
>>> "From those who have been given much, much will be expected"
>>> "the arc of the moral universe is long, but it bends toward justice" -MLK 
>>> Jr.
>>>
>>> ______________________________________________
>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>
> **********************
> "The contents of this message do not reflect any position of the U.S. 
> Government or NOAA."
> **********************
> Roy Mendelssohn
> Supervisory Operations Research Analyst
> NOAA/NMFS
> Environmental Research Division
> Southwest Fisheries Science Center
> ***Note new address and phone***
> 110 Shaffer Road
> Santa Cruz, CA 95060
> Phone: (831)-420-3666
> Fax: (831) 420-3980
> e-mail: roy.mendelss...@noaa.gov www: http://www.pfeg.noaa.gov/
>
> "Old age and treachery will overcome youth and skill."
> "From those who have been given much, much will be expected"
> "the arc of the moral universe is long, but it bends toward justice" -MLK Jr.
>

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to