Looking at the plots you sent, I have no idea if you have **spatial** structure in your model residuals. You definitely have heteroskedasticity and huge residuals on the high end of fitted values, but these aren't necessarily due to spatial structure.
My guess would be the core cause is that you're using an interpolation-ish method, random forests, and training it on data that effectively stops at ~75, looking at your histogram. Regular random forests are only going to predict within the range of values that they were trained with, and their predictions are going to be weighted towards the mean outcome value in the training set. The random forest then uses your input variables to predict values further away from that mean, but it's only able to identify relationships between the levels of predictors and outcomes in the training data. Given the high number of low NTL observations in your data, the random forest is able to build a decent model for the lower end of NTL; given that there are very few high NTL observations, the random forest mostly doesn't bother trying to predict them. You could attempt to add more predictors like land cover classification or to assign your high NTL observations higher case weights, in order to make the model more incentivized and more able to predict high NTL values. But probably, given the sheer rarity of these values, doing so will get you a model that's confidently wrong in predicting high values, with abrupt jumps in predicted NTL values in the map. I think the main way to improve this model would be to add more high NTL observations to the training set. As an aside, perhaps my favorite part of Edzer and Roger's new book is their discussion of spatial autocorrelation (https://r-spatial.org/book/15-Measures.html). The quote I use is "the important point [is] that spatial autocorrelation is not inherent in spatial phenomena, but often, is engendered by inappropriate entitation, by omitted variables and/or inappropriate functional form." My guess is that you're going to be best served here by focusing on the model, and not the spatial side of things -- any spatial structure you do see in your residuals can likely be addressed by considering what variables you're using in your model. Thanks, Michael Mahoney Michael Mahoney 781-812-8842 | mike.mahoney....@gmail.com Website | LinkedIn | GitHub On Sun, Oct 1, 2023 at 5:21 AM Nikolaos Tziokas <nikos.tzio...@gmail.com> wrote: > > I have a response variable which is the nighttime light luminosity (called > *ntl* in the data.frame I am using). As you can see in the image attached > (link below), it has some bright spots (highlighted in red) which are areas > with high brightness (outliers in the data.frame). Also, I shared the > histogram of the response. It's clear that the distribution is right-skewed > (and possibly there is bimodality?). > > *Purpose of my analysis* > My goal is to predict the *ntl* from the coarse spatial resolution to a > higher. This means that, I need to maintain these bright spots (the > outliers) in the predicted image. Because at a later stage, I will > downscale the residuals of the regression using area-to-point Kriging, I > need the residuals to be random (no spatial structure). > > *Analysis* > Following this <https://juliasilge.com/blog/ikea-prices/>approach, the > resulting RF residuals show clearly a spatial structure (i.e., the > residuals do not show a random pattern), especially it seems that the model > cannot capture the areas with high luminosity. This can also be seen if I > plot the fitted vs residuals (please see the GDrive link below). Again, for > a "good" model I would like to see a symmetric scatter of points around the > horizontal like at zero, indicating random deviations of predictions from > the observed values, but from the plot above I see that this isn't the case. > > My thought is that in the NTL image one can see where the bright areas are > and were not, while in the covariates there is no such "variation" in these > areas. This means that, the RFR model I use cannot capture this variability > (I'm just thinking out loud). > > What are your thoughts on that? How can I improve the model in order to > eliminate the spatial structure in residuals while mainting the bright > areas in the predicted high resolution image? Here is the complete code: > > wd <- "path/" > > # this a projected CRS > provoliko <- "EPSG:24313" > > # the data.frame of the variables > df <- read.csv(paste0(wd, 'block.data.csv')) > > # here I extract the coordinates > crds <- df[, 1:2] > > # here I remove the coordinates from the data.frame > df <- df[, 3:12] > > # random forest regression > library(tidymodels) > set.seed(456) > > # splitting the data.frame into training and test set using stratified > random sampling > ikea_split <- initial_split(df, strata = ntl, prop = 0.80) > ikea_train <- training(ikea_split) > ikea_test <- testing(ikea_split) > > set.seed(567) > ikea_folds <- bootstraps(ikea_train, strata = ntl) > ikea_folds > > library(usemodels) > eq1 <- ntl ~ . > use_ranger(eq1, data = ikea_train) > > library(textrecipes) > ranger_recipe <- > recipe(formula = eq1, data = ikea_train) > > ranger_spec <- > rand_forest(mtry = tune(), min_n = tune(), trees = 1001) %>% > set_mode("regression") %>% > set_engine("ranger") > > ranger_workflow <- > workflow() %>% > add_recipe(ranger_recipe) %>% > add_model(ranger_spec) > > set.seed(678) > doParallel::registerDoParallel() > ranger_tune <- > tune_grid(ranger_workflow, > resamples = ikea_folds, > grid = 10 > ) > > show_best(ranger_tune, metric = "rsq") > > autoplot(ranger_tune) > > final_rf <- ranger_workflow %>% > finalize_workflow(select_best(ranger_tune, metric = "rsq")) > > final_rf > > ikea_fit <- last_fit(final_rf, ikea_split) > ikea_fit > > collect_metrics(ikea_fit) > > collect_predictions(ikea_fit) %>% > ggplot(aes(ntl, .pred)) + > geom_abline(lty = 2, color = "gray50") + > geom_point(alpha = 0.5, color = "midnightblue") + > coord_fixed() > > predict(ikea_fit$.workflow[[1]], ikea_test[15, ]) > > ikea_all <- predict(ikea_fit$.workflow[[1]], df[, 2:11]) > ikea_all <- as.data.frame(cbind(df$ntl, ikea_all)) > ikea_all$rsds <- ikea_all$`df$ntl` - ikea_all$.pred > ikea_all <- as.data.frame(cbind(crds, ikea_all)) > ikea_all <- ikea_all[,-3:-4] > > # export the residuals as a raster > library(terra) > > rsds <- rast(ikea_all, type = "xyz") > crs(rsds) <- provoliko > writeRaster(rsds, paste0(wd, "rsds.tif"), overwrite = TRUE) > > library(vip) > > imp_spec <- ranger_spec %>% > finalize_model(select_best(ranger_tune, metric = "rsq")) %>% > set_engine("ranger", importance = "permutation") > > workflow() %>% > add_recipe(ranger_recipe) %>% > add_model(imp_spec) %>% > fit(ikea_train) %>% > extract_fit_parsnip() %>% > vip(aesthetics = list(alpha = 0.8, fill = "midnightblue")) > > Because the csv I'm using has several thousands of rows, I can share it via > a link, from here > <https://drive.google.com/drive/folders/1V115zpdU2-5fXssI6iWv_F6aNu4E5qA7?usp=sharing> > (also in the link you can see the spatial structure of residuals and > histogram of the response variable). Just so you know, running and tuning a > model takes ~ 2-3 minutes on my laptop (8 gigs of RAM, 4 cores processor > (CPU: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz)). i am using R 4.3.1 and > RStudio 2023.09.0 Build 463, Windows 11. > > -- > Tziokas Nikolaos > Cartographer > > Tel:(+44)07561120302 > LinkedIn <http://linkedin.com/in/nikolaos-tziokas-896081130> > > [[alternative HTML version deleted]] > > _______________________________________________ > R-sig-Geo mailing list > R-sig-Geo@r-project.org > https://stat.ethz.ch/mailman/listinfo/r-sig-geo _______________________________________________ R-sig-Geo mailing list R-sig-Geo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-geo