Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method

Roger Bivand Thu, 07 Nov 2019 01:14:27 -0800

On Thu, 7 Nov 2019, Robert R wrote:

Dear Roger,
Many thanks for your help.

I have an additional question:
Is it possible to create a "separate" lw (nb2listw) (with differentrownumbers) from my data set? For now, I am taking my data set andmerging with the sf object polygon_nyc with the function"merge(polygon_nyc, listings, by=c("zipcode" = "zipcode"))", so I createa huge n x n matrix (depending of the size of my data set).
Taking the polygon_nyc alone and turning it to a lw (weights list)object has only n = 177.
Of course running
spatialreg::lagsarlm(formula=model, data = listings_sample,spatialreg::polygon_nyc_lw, tol.solve=1.0e-10)
does not work ("Input data and weights have different dimensions").
The only option is to take my data set, merge it to my polygon_nyc (byzipcode) and then create the weights list lw? Or there another option?

I think we are getting more clarity. You do not know the location of thelettings beyond their zipcode. You do know the boundaries of the zipcodeareas, and can create a neighbour object from these boundaries. You thenwant to treat all the lettings in a zipcode area i as neighbours, andadditionally lettings in zipcode areas neighbouring i as neighbours oflettings in i. This is the data structure that motivated thespdep::nb2blocknb() function:


https://r-spatial.github.io/spdep/reference/nb2blocknb.html

Try running the examples to get a feel for what is going on.

I feel that most of the variability will vanish in the very large numbersof neighbours, over-smoothing the outcomes. If you do not have locationsfor the lettings themselves, I don't think you can make much progress.

You could try a linear mixed model (or gam with a spatially structuredrandom effect) with a temporal and a spatial random effect. See the HSARpackage, articles by Dong et al., and maybehttps://doi.org/10.1016/j.spasta.2017.01.002 for another survey. Neitherthis nor Dong et al. handle spatio-temporal settings. MRF spatial randomeffects at the zipcode level might be a way forward, together with an IIDrandom effect at the same level (equivalent to sef-neighbours).


Hope this helps,

Roger


Best regards,
Robert

________________________________________
From: Roger Bivand <roger.biv...@nhh.no>
Sent: Wednesday, November 6, 2019 15:07
To: Robert R
Cc: r-sig-geo@r-project.org
Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method

On Tue, 5 Nov 2019, Robert R wrote:

Dear Roger,

Thank you for your reply. I disabled HTML; my e-mails should be now in
plain text.

I will give a better context for my desired outcome.

I am taking Airbnb's listings information for New York City available
on: http://insideairbnb.com/get-the-data.html

I save every listings.csv.gz file available for NYC (2015-01 to 2019-09)
- in total, 54 files/time periods - as a YYYY-MM-DD.csv file into a
Listings/ folder. When importing all these 54 files into one single data
set, I create a new "date_compiled" variable/column.

In total, after the data cleansing process, I have a little more 2
million observations.


You have repeat lettings for some, but not all properties. So this is at
best a very unbalanced panel. For those properties with repeats, you may
see temporal movement (trend/seasonal).

I suggest (strongly) taking a single borough or even zipcode with some
hindreds of properties, and working from there. Do not include the
observation as its own neighbour, perhaps identify repeats and handle them
specially (create or use a property ID). Unbalanced panels may also create
a selection bias issue (why are some properties only listed sometimes?).

So this although promising isn't simple, and getting to a hedonic model
may be hard, but not (just) because of spatial autocorrelation. I wouldn't
necessarily trust OLS output either, partly because of the repeat property
issue.

Roger


I created 54 timedummy variables for each time period available.

I want to estimate using a hedonic spatial timedummy model the impact of
a variety of characteristics which potentially determine the daily rate
on Airbnb listings through time in New York City (e.g. characteristics
of the listing as number of bedrooms, if the host if professional,
proximity to downtown (New York City Hall) and nearest subway station
from the listing, income per capita, etc.).

My dependent variable is price (log price, common in the related
literature for hedonic prices).

The OLS model is done.

For the spatial model, I am assuming that hosts, when deciding the
pricing of their listings, take not only into account its structural and
location characteristics, but also the prices charged by near listings
with similar characteristics - spatial autocorrelation is then present,
at least spatial dependence is present in the dependent variable.

As I wrote in my previous post, I was willing to consider the neighbor
itself as a neighbor.

Parts of my code can be found below:

########

## packages

packages_install <- function(packages){
new.packages <- packages[!(packages %in% installed.packages()[, "Package"])]
if (length(new.packages))
install.packages(new.packages, dependencies = TRUE)
sapply(packages, require, character.only = TRUE)
}

packages_required <- c("bookdown", "cowplot", "data.table", "dplyr", "e1071", "fastDummies", "ggplot2", "ggrepel", "janitor", "kableExtra", "knitr", "lubridate", 
"nngeo", "plm", "RColorBrewer", "readxl", "scales", "sf", "spdep", "stargazer", "tidyverse")
packages_install(packages_required)

# Working directory
setwd("C:/Users/User/R")



## shapefile_us

# Shapefile zips import and Coordinate Reference System (CRS) transformation
# Shapefile download: 
https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_zcta510_500k.zip
shapefile_us <- sf::st_read(dsn = "Shapefile", layer = 
"cb_2018_us_zcta510_500k")

# Columns removal
shapefile_us <- shapefile_us %>% select(-c(AFFGEOID10, GEOID10, ALAND10, 
AWATER10))

# Column rename: ZCTA5CE10
setnames(shapefile_us, old=c("ZCTA5CE10"), new=c("zipcode"))

# Column class change: zipcode
shapefile_us$zipcode <- as.character(shapefile_us$zipcode)



## polygon_nyc

# Zip code not available in shapefile: 11695
polygon_nyc <- shapefile_us %>% filter(zipcode %in% zips_nyc)



## weight_matrix

# Neighboring polygons: list of neighbors for each polygon (queen contiguity 
neighbors)
polygon_nyc_nb <- poly2nb((polygon_nyc %>% select(-borough)), queen=TRUE)

# Include neighbour itself as a neighbour
# for(i in 
1:length(polygon_nyc_nb)){polygon_nyc_nb[[i]]=as.integer(c(i,polygon_nyc_nb[[i]]))}
polygon_nyc_nb <- include.self(polygon_nyc_nb)

# Weights to each neighboring polygon
lw <- nb2listw(neighbours = polygon_nyc_nb, style="W", zero.policy=TRUE)



## listings

# Data import
files <- list.files(path="Listings/", pattern=".csv", full.names=TRUE)
listings <- setNames(lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE, 
encoding="UTF-8")), files)
listings <- mapply(cbind, listings, date_compiled = names(listings))
listings <- listings %>% bind_rows

# Characters removal
listings$date_compiled <- gsub("Listings/", "", listings$date_compiled)
listings$date_compiled <- gsub(".csv", "", listings$date_compiled)
listings$price <- gsub("\\$", "", listings$price)
listings$price <- gsub(",", "", listings$price)



## timedummy

timedummy <- sapply("date_compiled_", paste, unique(listings$date_compiled), 
sep="")
timedummy <- paste(timedummy, sep = "", collapse = " + ")
timedummy <- gsub("-", "_", timedummy)



## OLS regression

# Pooled cross-section data - Randomly sampled cross sections of Airbnb 
listings price at different points in time
regression <- plm(formula=as.formula(paste("log_price ~ #some variables", timedummy, sep = "", collapse = 
" + ")), data=listings, model="pooling", index="id")

########

Some of my id's repeat in multiple time periods.

I use NYC's zip codes to left join my data with the neighborhood zip code 
specific characteristics, such as income per capita to that specific zip code, 
etc.

Now I want to apply the hedonic model with the timedummy variables.

Do you know how to proceed? 1) Which package to use (spdep/splm)?; 2) Do I have to join 
the polygon_nyc (by zip code) to my listings data set, and then calculate the weight 
matrix "lw"?

Again, thank you very much for the help provided until now.

Best regards,
Robert

________________________________________
From: Roger Bivand <roger.biv...@nhh.no>
Sent: Tuesday, November 5, 2019 15:30
To: Robert R
Cc: r-sig-geo@r-project.org
Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method

On Tue, 5 Nov 2019, Robert R wrote:

I have a large pooled cross-section data set. I would like to
estimate/regress using spatial autocorrelation methods. I am assuming
for now that spatial dependence is present in both the dependent
variable and the error term. My data set is over a period of 4 years,
monthly data (54 periods). For this means, I've created a time dummy
variable for each time period. I also created a weight matrix using the
functions "poly2nb" and "nb2listw". Now I am trying to figure out a way
to estimate my model which contains a really big data set. Basically, my
model is as follows: y = γD + ρW1y + Xβ + λW2u + ε My questions are: 1)
My spatial weight matrix for the whole data set will be probably a
enormous matrix with submatrices for each time period itself. I don't
think it would be possible to calculate this. What I would like to know
is a way to estimate each time dummy/period separately (to compare
different periods alone). How to do it? 2) Which package to use: spdep
or splm? Thank you and best regards, Robert


Please do not post HTML, only plain text. Almost certainly your model
specification is wrong (SARAR/SAC is always a bad idea if alternatives are
untried). What is your cross-sectional size? Using sparse kronecker
products, the "enormous" matrix may not be very big. Does it make any
sense using time dummies (54 x N x T will be mostly zero anyway)? Are most
of the covariates time-varying? Please provide motivation and use area
(preferably with affiliation (your email and user name are not
informative) - this feels like a real estate problem, probably wrongly
specified. You should use splm if time make sense in your case, but if it
really doesn't, simplify your approach, as much of the data will be
subject to very large temporal autocorrelation.

If this is a continuation of your previous question about using
self-neighbours, be aware that you should not use self-neighbours in
modelling, they are only useful for the Getis-Ord local G_i^* measure.

Roger


      [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-geo


--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: roger.biv...@nhh.no
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en


--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: roger.biv...@nhh.no
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en


--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: roger.biv...@nhh.no
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method

Reply via email to