Re: [R-sig-Geo] Forming Spatial Weights Matrix with Large Datasets (>150, 000 observations)

Roger Bivand Tue, 16 Nov 2010 02:19:46 -0800

On Tue, 16 Nov 2010, John GIBSON wrote:

Dear List
I am keen to receive advice from listers on the best strategies in R forforming and using a spatial weights matrix for a large (ca. 150,000)dataset. Is it fastest and most memory-efficient to do the entireanalysis in R, or is it better to use GeoDa to form a gal or gwt fileand then import that into R and then procedue with commands from spdep?
The background is that I am using R.2.12.0 (64 bit) on a 16-Gb dual coreWindows XP setup to run spatial diagnostics (Moran and LM tests) on OLSresiduals and estimate spatial error and spatial lag models on anenvironmental dataset, with 150,000 observations (1 km pixels). To startwith more manageable samples I have grid-sampled pixels at theintersection of every 4th, 8th and 16th row and column, giving samplesof ca. 9000, 2300 and 600. To ensure no "islands" when using thesmallest samples, the neighbourhood for the weights matrix is set as 34km, and the weights are inverse-distance. In case anyone asks "why areyou even bothering to do this" it is to examine the impact of thespatial sampling, since we have previously based our analysis (ofimpacts of roads on forest cover) on a 1-in-25 sample and now want tosee how sensitive inferences might be to this sort of spatial sampling.
Back to my current problem: while R breezes through the small samples(less than a minute for a sample of 600, about 20 minutes for a sampleof 2300) it has been running for 3 days now on a sample of 9000observations, using around about 50% of the CPU and about 5.6 Gb of the16 Gb available on the machine. I have access to a cluster computer with128-CPUs to run the analysis for the full population of 150,000observations, but based on the slow performance with 9000 on a 16-Gbmachine I am not optimistic about the ability of the cluster computer torun the data at full 1km scale.

If you ran the steps separately, you would see where the bottleneck is. Myguess is that method="eigen" in the first spatial regression is theunderlying problem, and that replacing this with sparse matrix techniqueswill resolve the problem: method="Matrix". See also comments inline incode below, as you seem to create dense matrices when sparserepresentations are all you need.


Roger

Hopefully it won't colour any answers I receive, but I am primarily aStata user, and with 64-bit multi-processor stata using our own updateof stata's spatwmat program (updated to take advantage of Mata - thematrix programming language that was unavailable when spatwmat was firstwritten) for the sample of 9000 it takes about 2 hours to form thespatial weights, do the Moran and LM tests on OLS and run the spatiallag and spatial error models. So rather faster than we are managing in R(but probably due to our novice R programming, so I am certainly notcomplaining about R here). But Stata is not a candidate for running theanalysis on the 128-cluster computer because of licencing issues, whileR can be installed on each CPU with no licensing issues. So I need toknow the most memory-efficient and fastest way to form and use a spatialweights matrix for ca. 150,000 observations.
If people haven't completely lost interest in the question by now, I ampasting a fragment of the code we are using, noting that typically wehave found the closest command to what we knew how to do in Stata, andso this may not be the most efficient way in R of forming the weightsmatrix and then using it. So I would be grateful for pointers, before Iend up burning through a lot of computing time on the cluster computerwith an inefficient set up..
Thanks,

John Gibson


#  qui nspatwmat, name(W_34km_ds) xcoord(x) ycoord(y) band(0 34000) standardize 
eigenval(E_34km_ds)

coords <- cbind(rawd$x,rawd$y)
W_34km_nb <- dnearneigh(coords, 0, 34000, row.names = NULL, longlat = NULL)


What does:

print(W_34km_nb)

report as the average number of neighbours? Is it too large for yourhypothesised processes?


# Eigen value matrix for the weight matrix - slightly different from Stata 
results
#E_34km_ds <- eigenw(W_34km_dsls, quiet=NULL)


No, unnecessary, created on the fly, and is dense.


# Get the Moran stats for -fa00-, call it A
# nspatcorr fa00, bands(0 34000 34000) xcoord(x) ycoord(y) cumul

You may save repeated calls to nb2listw() by doing it once and storing thelistw object for use:


lw <- nb2listw(W_34km_nb,style = "W")
fa00.moran <- moran.test(rawd$fa00, lw)

and so on ...

fa00.moran <- moran.test(rawd$fa00,nb2listw(W_34km_nb,style = "W"))
print(fa00.moran)

# Moran stats for -uhat-, B


No, use lm.morantest() passing the lm object:

uhat.moran <- moran.test(rawd.lm, lw)

uhat.moran <- moran.test(rawd.lm$residuals,nb2listw(W_34km_nb,style = "W"))
print(uhat.moran)

# Weight matrix, inverse distance, standardise
temp <- nbdists(W_34km_nb, coords, longlat = NULL)
W_34km_ivds <-lapply(temp, function(x) 1/x)


Why create the dense matrix?

W_34km_dsls <- nb2listw(W_34km_nb, glist=W_34km_ivds, style="W", ...

W_34km_dsmat <- nb2mat(W_34km_nb, glist=W_34km_ivds, style="W", 
zero.policy=TRUE)

#W_34km_dsmat[1:5,1:5] - Same as the Stata results

W_34km_dsls <- mat2listw(W_34km_dsmat)

# Spatial diagnose, weight matrix (W_34km_ds)
rawd.lm.diag<-lm.LMtests(rawd.lm, W_34km_dsls, test=c("LMerr", "LMlag", 
"RLMerr","RLMlag"))
print(rawd.lm.diag)

rm(coords,W_34km_nb, temp,W_34km_ivds,W_34km_dsmat)
rm(rawd.lm)

As stated above, don't use method="eigen" for larger N, usemethod="Matrix", or another exact sparse method, or an approximation, suchas Monte Carlo or Chebyshev. Running N=25000 on a 1GB laptop isn't aproblem, with sparse matrix techniques or approximations, everythingbecomes easier.

Note also that you need to use impacts() on the output of lagsarlm()because the coefficients should not be interpreted like OLS coefficients -see the references in ?impacts.

# spatial lag, weights(W_34km_ds), same RHS as the LM
rawd.lag <- lagsarlm(fa00 ~d2roadf+d2nr_ld51_00f+s_ld10_00_10+protectf+non_adjacent+d2pvcapf+slo+dem+pa+ta+soil_n+soil_ap+soil_ph+s_pop_00_10+s_gdp_00_10,data=rawd, W_34km_dsls, method="eigen")
_______________________________________________
R-sig-Geo mailing list
R-sig-Geo@stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/r-sig-geo


--
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: roger.biv...@nhh.no

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo@stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

Re: [R-sig-Geo] Forming Spatial Weights Matrix with Large Datasets (>150, 000 observations)

Reply via email to