[R-sig-Geo] Forming Spatial Weights Matrix with Large Datasets (>150, 000 observations)

John GIBSON Tue, 16 Nov 2010 01:48:28 -0800

Dear List

I am keen to receive advice from listers on the best strategies in R for 
forming and using a spatial weights matrix for a large (ca. 150,000) dataset. 
Is it fastest and most memory-efficient to do the entire analysis in R, or is 
it better to use GeoDa to form a gal or gwt file and then import that into R 
and then procedue with commands from spdep?


The background is that I am using R.2.12.0 (64 bit) on a 16-Gb dual core 
Windows XP setup to run spatial diagnostics (Moran and LM tests) on OLS 
residuals and estimate spatial error and spatial lag models on an environmental 
dataset, with 150,000 observations (1 km pixels). To start with more manageable 
samples I have grid-sampled pixels at the intersection of every 4th, 8th and 
16th row and column, giving samples of ca. 9000, 2300 and 600. To ensure no 
"islands" when using the smallest samples, the neighbourhood for the weights 
matrix is set as 34 km, and the weights are inverse-distance. In case anyone 
asks "why are you even bothering to do this" it is to examine the impact of the 
spatial sampling, since we have previously based our analysis (of impacts of 
roads on forest cover) on a 1-in-25 sample and now want to see how sensitive 
inferences might be to this sort of spatial sampling.

Back to my current problem: while R breezes through the small samples (less 
than a minute for a sample of 600, about 20 minutes for a sample of 2300) it 
has been running for 3 days now on a sample of 9000 observations, using around 
about 50% of the CPU and about 5.6 Gb of the 16 Gb available on the machine. I 
have access to a cluster computer with 128-CPUs to run the analysis for the 
full population of 150,000 observations, but based on the slow performance with 
9000 on a 16-Gb machine I am not optimistic about the ability of the cluster 
computer to run the data at full 1km scale.

Hopefully it won't colour any answers I receive, but I am primarily a Stata 
user, and with 64-bit multi-processor stata using our own update of stata's 
spatwmat program (updated to take advantage of Mata - the matrix programming 
language that was unavailable when spatwmat was first written) for the sample 
of 9000 it takes about 2 hours to form the spatial weights, do the Moran and LM 
tests on OLS and run the spatial lag and spatial error models. So rather faster 
than we are managing in R (but probably due to our novice R programming, so I 
am certainly not complaining about R here).  But Stata is not a candidate for 
running the analysis on the 128-cluster computer because of licencing issues, 
while R can be installed on each CPU with no licensing issues. So I need to 
know the most memory-efficient and fastest way to form and use a spatial 
weights matrix for ca. 150,000 observations.

If people haven't completely lost interest in the question by now, I am pasting 
a fragment of the code we are using, noting that typically we have found the 
closest command to what we knew how to do in Stata, and so this may not be the 
most efficient way in R of forming the weights matrix and then using it. So I 
would be grateful for pointers, before I end up burning through a lot of 
computing time on the cluster computer with an inefficient set up..

Thanks,

John Gibson


#  qui nspatwmat, name(W_34km_ds) xcoord(x) ycoord(y) band(0 34000) standardize 
eigenval(E_34km_ds)

coords <- cbind(rawd$x,rawd$y)
W_34km_nb <- dnearneigh(coords, 0, 34000, row.names = NULL, longlat = NULL)

# Eigen value matrix for the weight matrix - slightly different from Stata 
results
#E_34km_ds <- eigenw(W_34km_dsls, quiet=NULL)

# Get the Moran stats for -fa00-, call it A
# nspatcorr fa00, bands(0 34000 34000) xcoord(x) ycoord(y) cumul

fa00.moran <- moran.test(rawd$fa00,nb2listw(W_34km_nb,style = "W"))
print(fa00.moran)

# Moran stats for -uhat-, B

uhat.moran <- moran.test(rawd.lm$residuals,nb2listw(W_34km_nb,style = "W"))
print(uhat.moran)

# Weight matrix, inverse distance, standardise
temp <- nbdists(W_34km_nb, coords, longlat = NULL)
W_34km_ivds <-lapply(temp, function(x) 1/x)
W_34km_dsmat <- nb2mat(W_34km_nb, glist=W_34km_ivds, style="W", 
zero.policy=TRUE)

#W_34km_dsmat[1:5,1:5] - Same as the Stata results

W_34km_dsls <- mat2listw(W_34km_dsmat)

# Spatial diagnose, weight matrix (W_34km_ds)
rawd.lm.diag<-lm.LMtests(rawd.lm, W_34km_dsls, test=c("LMerr", "LMlag", 
"RLMerr","RLMlag"))
print(rawd.lm.diag)

rm(coords,W_34km_nb, temp,W_34km_ivds,W_34km_dsmat)
rm(rawd.lm)

# spatial lag, weights(W_34km_ds), same RHS as the LM
  rawd.lag <- lagsarlm(fa00 ~
d2roadf+d2nr_ld51_00f+s_ld10_00_10+protectf+non_adjacent+d2pvcapf+slo+dem+pa+ta+soil_n+soil_ap+soil_ph+s_pop_00_10+s_gdp_00_10,
 data=rawd, W_34km_dsls, method="eigen")

_______________________________________________
R-sig-Geo mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

[R-sig-Geo] Forming Spatial Weights Matrix with Large Datasets (>150, 000 observations)

Reply via email to