On Friday, August 14, 2015, Charles Novaes de Santana < [email protected]> wrote:
> > 1) to use only the subset of suitable habitats to build the matrix of > distances (and then to use sparse matrix as suggested by Stefan) > Distance matrices are not usually sparse – since the farthest apart pairs of points have large distances and are the most common and least interesting. However, you could store only distances for close points in a sparse matrix and use zero to represent the distance between pairs of points that are not close enough to be of interest. Either that or you could store 1/d instead of d and then closer points have higher weights and you can threshold 1/distance so that far apart points have zero entries. > 2) to use a machine with more memory and try to run my models using the > matrices with all the sites > This is probably the easiest thing to do since your data set is not of a truly unreasonable size, just largish. However, you may be much happier if you can make your problem smaller than O(n^2). > 3) to try another language/library that might work better with such big > amount of data (like python, or R). > This problem isn't going to be fundamentally different no matter what language you use: you have more data than fits in memory. Spilling memory to disk is going to be *much* slower than just recomputing distances – orders of magnitude slower. As John suggested, is there any particular reason you need to materialize all of these values in a matrix? What computation are you going to perform over that matrix?
