I have a response variable which is the nighttime light luminosity (called *ntl* in the data.frame I am using). As you can see in the image attached (link below), it has some bright spots (highlighted in red) which are areas with high brightness (outliers in the data.frame). Also, I shared the histogram of the response. It's clear that the distribution is right-skewed (and possibly there is bimodality?).
*Purpose of my analysis* My goal is to predict the *ntl* from the coarse spatial resolution to a higher. This means that, I need to maintain these bright spots (the outliers) in the predicted image. Because at a later stage, I will downscale the residuals of the regression using area-to-point Kriging, I need the residuals to be random (no spatial structure). *Analysis* Following this <https://juliasilge.com/blog/ikea-prices/>approach, the resulting RF residuals show clearly a spatial structure (i.e., the residuals do not show a random pattern), especially it seems that the model cannot capture the areas with high luminosity. This can also be seen if I plot the fitted vs residuals (please see the GDrive link below). Again, for a "good" model I would like to see a symmetric scatter of points around the horizontal like at zero, indicating random deviations of predictions from the observed values, but from the plot above I see that this isn't the case. My thought is that in the NTL image one can see where the bright areas are and were not, while in the covariates there is no such "variation" in these areas. This means that, the RFR model I use cannot capture this variability (I'm just thinking out loud). What are your thoughts on that? How can I improve the model in order to eliminate the spatial structure in residuals while mainting the bright areas in the predicted high resolution image? Here is the complete code: wd <- "path/" # this a projected CRS provoliko <- "EPSG:24313" # the data.frame of the variables df <- read.csv(paste0(wd, 'block.data.csv')) # here I extract the coordinates crds <- df[, 1:2] # here I remove the coordinates from the data.frame df <- df[, 3:12] # random forest regression library(tidymodels) set.seed(456) # splitting the data.frame into training and test set using stratified random sampling ikea_split <- initial_split(df, strata = ntl, prop = 0.80) ikea_train <- training(ikea_split) ikea_test <- testing(ikea_split) set.seed(567) ikea_folds <- bootstraps(ikea_train, strata = ntl) ikea_folds library(usemodels) eq1 <- ntl ~ . use_ranger(eq1, data = ikea_train) library(textrecipes) ranger_recipe <- recipe(formula = eq1, data = ikea_train) ranger_spec <- rand_forest(mtry = tune(), min_n = tune(), trees = 1001) %>% set_mode("regression") %>% set_engine("ranger") ranger_workflow <- workflow() %>% add_recipe(ranger_recipe) %>% add_model(ranger_spec) set.seed(678) doParallel::registerDoParallel() ranger_tune <- tune_grid(ranger_workflow, resamples = ikea_folds, grid = 10 ) show_best(ranger_tune, metric = "rsq") autoplot(ranger_tune) final_rf <- ranger_workflow %>% finalize_workflow(select_best(ranger_tune, metric = "rsq")) final_rf ikea_fit <- last_fit(final_rf, ikea_split) ikea_fit collect_metrics(ikea_fit) collect_predictions(ikea_fit) %>% ggplot(aes(ntl, .pred)) + geom_abline(lty = 2, color = "gray50") + geom_point(alpha = 0.5, color = "midnightblue") + coord_fixed() predict(ikea_fit$.workflow[[1]], ikea_test[15, ]) ikea_all <- predict(ikea_fit$.workflow[[1]], df[, 2:11]) ikea_all <- as.data.frame(cbind(df$ntl, ikea_all)) ikea_all$rsds <- ikea_all$`df$ntl` - ikea_all$.pred ikea_all <- as.data.frame(cbind(crds, ikea_all)) ikea_all <- ikea_all[,-3:-4] # export the residuals as a raster library(terra) rsds <- rast(ikea_all, type = "xyz") crs(rsds) <- provoliko writeRaster(rsds, paste0(wd, "rsds.tif"), overwrite = TRUE) library(vip) imp_spec <- ranger_spec %>% finalize_model(select_best(ranger_tune, metric = "rsq")) %>% set_engine("ranger", importance = "permutation") workflow() %>% add_recipe(ranger_recipe) %>% add_model(imp_spec) %>% fit(ikea_train) %>% extract_fit_parsnip() %>% vip(aesthetics = list(alpha = 0.8, fill = "midnightblue")) Because the csv I'm using has several thousands of rows, I can share it via a link, from here <https://drive.google.com/drive/folders/1V115zpdU2-5fXssI6iWv_F6aNu4E5qA7?usp=sharing> (also in the link you can see the spatial structure of residuals and histogram of the response variable). Just so you know, running and tuning a model takes ~ 2-3 minutes on my laptop (8 gigs of RAM, 4 cores processor (CPU: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz)). i am using R 4.3.1 and RStudio 2023.09.0 Build 463, Windows 11. -- Tziokas Nikolaos Cartographer Tel:(+44)07561120302 LinkedIn <http://linkedin.com/in/nikolaos-tziokas-896081130> [[alternative HTML version deleted]] _______________________________________________ R-sig-Geo mailing list R-sig-Geo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-geo