Hi MADlib Developers,
To follow Ivan and Frank's suggestion, I am trying to propose the
description and interface of Geographically weighted regression (GWR). PostGIS
functions will be invoked to compute distance in some CRS and extract rectangle
coordinates of study area. If MADlib doesn't have access to PostGIS routines,
we can only implement some simple GIS utils with our own code .
GWR models a local relationship of a numerical dependent variable to one
or more explanatory independent variables to build a model of spatially varying
relationships. It has been widely used for understanding the spatial pattern of
natural or social phenomena .
GWR constructs local equations
seperately for each location in the table incorporating the dependent
and independent variables falling within the bandwidth of each target
geometry. The shape and
extent of the bandwidth is dependent on the spatial kernel type( guass,
exp and bisquare), distance in fixed methods ( or number of neighbors
parameters in adpative methods ). Therefore, the computational burden of GWR
increases with prediction locations. Parallelized GWR is necessary in
high-performance environment such as GPDB.
There are two important hints about GWR. Firstly, GWR can estimate
coefficients in any locations but can only provide diagnostic information in
observation locations. In addition, according to P ez et al.(2011), the basic
GWR is not an appropriate method for small sample sizes (<160). Many advanced
geographically-weighted methods are proposed in some papers (see Wheeler DC
2009, Brunsdon C et al. 2012,Gollini I et al. 2015) which are planned to
implement in the future. The description about interface and function
for GWR is also provided . Coefficients columns in output are seperated for
easily mapping result in GIS. Can you kindly take a look and give me advice
or feedback to improve it ? Many Thanks!
Best,ChenLiang Wang
--------------------------------------------------------------------------------------------------------------------------------------
The description about Geographically Weighted Regression (Spatial
Statistics->Regression Models)
Training Function of geographically weighted regression training function has
the following syntax.
gwregr_train(source_table,
out_table,
dependent_varname,
independent_varname,
kernel_params,
adaptive_option,
ftest_option,
regression_location,
prediction_location,
grouping_cols,
verbose
)
-----------------------------------------------------------------------------------------------------------------------------------
Arguments
source_table
TEXT. The name of the table containing the training data.
out_table
TEXT. Name of the generated table containing the output model.
The output table contains the following columns.
<...> Any grouping columns provided during training. Present only if
the grouping option is used.
coef_<independent_varname1>, coef_<independent_varname2> ... FLOAT8[].
Any columns corresponding to independent_varname of the vector of coefficients
of the regression in each location.
r2 FLOAT8. R-squared coefficient of determination of the model.
adjr2 FLOAT8. Adjusted-R-squared coefficient of determination of the
model.
local_cond_no FLOAT8[]. The local condition number of GWR in each
location (see Wheeler D2007) indicates when results are unstable due to local
multicollinearity (above 30).
F1_stats FLOAT8[]. The F-test array{F-statistic,Numerator DF,Denominator
DF,p_value} for comparing Ordinary Linear Regression(OLR) and GWR models (see
Leung et al. 2000)
F2_stats FLOAT8[]. The F-test
array{F-statistic,Numerator DF,Denominator DF,p_value} for comparing
Ordinary Linear Regression(OLR) and GWR models (see Leung et al. 2000)
F3_stats FLOAT8[]. The spatial stationary test statistic for GWR
coefficients (see Leung et al. 2000)
F3_ndf FLOAT8[]. The spatial stationary test Numerator DF for GWR
coefficients
(see Leung et al. 2000)
F3_ddf FLOAT8[]. The spatial stationary test Denominator DF for GWR
coefficients
(see Leung et al. 2000)
F3_pv FLOAT8[]. The spatial stationary test p_value for GWR coefficients
(see Leung et al. 2000)
F4_stats FLOAT8[]. The F-test
array{F-statistic,Numerator DF,Denominator DF,p_value} for comparing
Ordinary Linear Regression(OLR) and GWR models (see GWR book p92)
num_missing_rows_skipped INTEGER. The number of rows that have NULL
values in the dependent and independent variables, and were skipped in the
computation for each group.
A summary table named <out_table>_summary is created together with the
output table. It has the following columns:
source_table The data source table name
out_table The output table name
dependent_varname The dependent variable
independent_varname The independent variables
num_rows_processed The total number of rows that were used in the
computation.
num_missing_rows_skipped The total number of rows that were skipped
because of NULL values in them.
kernel_function The spatial kernel function
bandwidth The bandwidth parameter
adaptive_option The Boolean variable indicates whether to perform a
adaptive kernel function.
dependent_varname
TEXT. Expression to evaluate for the dependent variable.
independent_varname
TEXT. Expression list to evaluate for the independent variables. An
intercept variable is not assumed. It is common to provide an explicit
intercept term by including a single constant 1 term in the independent
variable list.
kernel_params(optional)
TEXT,default: 'kernel=guass,bw=CV', Parameters for kernel function.
The kernel parameter is the name of the kernel function to use
‘gauss’: wgt = exp(-.5*(vdist/bw)^2);
‘exp’: wgt = exp(-vdist/bw);
‘bisquare’: wgt = (1-(vdist/bw)^2)^2 if vdist < bw, wgt=0 otherwise;
Where,wgt indicates weight ,vdist indicates vector of distance, and bw
indicates bandwidth.
We can select either CV or AICc when you aren't sure what to use for the
Distance or Number of neighbors parameter.We can also specify a numerical value
for bw.If bw is large enough(above 1e7,for example), the estimation of
coefficients in GWR is equal to the global estimation in ordinary linear
regression.
adaptive_option(optional)
BOOLEAN,default:FALSE. When TRUE, an adaptive kernel is calculated where
the bandwidth corresponds to the number of nearest neighbours (i.e. adaptive
distance)
ftest_option(optional)
BOOLEAN,default:FALSE . When TRUE, three F-tests and spatial-stationary
test of coefficients are also conducted and returned with the results according
to Leung et al. (2000).
regression_location
2D Point or Polygon Geometry, A geometry (usually 2D point geometry)
representing locations where training should be conducted. The length of
regression_location must be equal to the length of source_table.In most
cases,it is a geometry field of source_table.
prediction_location(optional)
2D Point or Polygon Geometry,default:regression_location. A geometry
(usually 2D point geometry) representing locations where estimation of
coefficients should be computed.
grouping_cols (optional)
TEXT, default: NULL. An expression list used to group the input dataset
into discrete groups, running one regression per group. Similar to the SQL
GROUP BY clause. When this value is null, no grouping is used and a single
result model is generated.
verbose(optional)
BOOLEAN, default: FALSE. Provides verbose output of the results of training.
---------------------------------------------------------------------------------------------------------------------------------------------
Prediction Function
gwregr_predict(coef, col_ind,newdata_table)
Arguments
coef
FLOAT8[][]. Vector of the coefficients of regression.
col_ind
FLOAT8[]. An array containing the independent variable column names.
newdata_table(optional)
TEXT. default: NULL. The name of table which provide new data in prediction
locations. If prediction_location is same as regression_locations (default
value) in training fucntion, this parameter is omitted automatically.
Otherwise, newdata_table is obligatory to provide independent variables with
identical field names in source_table in prediction locations .
> Date: Fri, 18 Dec 2015 09:18:22 -0800
> Subject: Re: How to contribute a spatial module to MADlib manipulating
> objects from PostGIS
> From: [email protected]
> To: [email protected]
>
> Thanks ChenLiang Wang for your interest.
>
> I would repeat Ivan's welcome to you, and I look forward to your
> contributions in the area of GIS.
>
> To answer your questions:
>
> 1. Yes, it is possible to call PostGIS functions from MADlib.
>
> 2. Yes, spatial statistics are suitable for MADlib.
>
> For documentation, please refer to the Apache MADlib wiki
> http://madlib.incubator.apache.org/
>
> which includes:
> Quick Start Guides
>
> Get going with a minimum of fuss.
>
> - Installation Guide
> <https://cwiki.apache.org/confluence/display/MADLIB/Installation+Guide>
> - Quick Start Guide for Users
>
> <https://cwiki.apache.org/confluence/display/MADLIB/Quick+Start+Guide+for+Users>
> - Quick Start Guide for Developers
>
> <https://cwiki.apache.org/confluence/display/MADLIB/Quick+Start+Guide+for+Developers>
>
>
> As Ivan mentioned, writing down the functions you would like to build and
> the interface is a good place to begin. Then we can discuss on the open
> mailing list.
>
> Regards,
> Frank
>
> On Thu, Dec 17, 2015 at 8:11 PM, 王晨 亮 <[email protected]> wrote:
>
> > Thanks for your quick reply. Your suggestion is great. I will give a
> > definitions and description for the spatial statistic functions and
> > comparison with ordinary statistic models.
> >
> >
> > > Date: Thu, 17 Dec 2015 21:56:06 -0500
> > > Subject: Re: How to contribute a spatial module to MADlib manipulating
> > objects from PostGIS
> > > From: [email protected]
> > > To: [email protected]
> > >
> > > Hi ChenLiang,
> > >
> > > I think your proposal is good and worth trying to do it!
> > >
> > > Can I suggest the first steps if you send a proposal of the function
> > > definitions and the parameters and return values as well as description
> > of
> > > the functions and what they do.
> > >
> > > Based on that we can discuss the design of the interface and once it
> > looks
> > > good you can start working on the actual implementation of the coding.
> > > When you get to implementation we can help you on technical challenges.
> > >
> > > Cheers,
> > > Ivan
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Dec 17, 2015 at 9:50 PM, 王晨 亮 <[email protected]> wrote:
> > >
> > > > Hi MADlib Developers,
> > > >
> > > >
> > > >
> > > >
> > > > I am a GIS Researcher and have some knowledge on PostGIS, Python,
> > > > C/C++,Java and R.
> > > >
> > > >
> > > >
> > > > I have learned some spatial statistical models during My PhD research
> > in
> > > > GIS. Recently, I have done a job translating GWR (Geographical Weighted
> > > > Regression) from R into Java for my company. And I would like to
> > > > contribute to MADLib if possible. I believe PostGIS and MADlib are the
> > > > most powerful extensions of PostgreSQL . Therefore, a spatial
> > statistical
> > > > module connecting the two libraries could be significant . If I can
> > start
> > > > the task , the first goal to implement will be GWR model.
> > > >
> > > >
> > > >
> > > > Now I am reading the developer guide of MADlib. I not quite sure how to
> > > > contribute a geospatial module to MADlib. Is it possible to manipulate
> > > > spatial object or attribute from PostGIS in MADlib ?
> > > >
> > > >
> > > >
> > > > So could anyone suggest a few pointers & links that I can follow to get
> > > > to know:
> > > >
> > > >
> > > >
> > > > 1. how to deal with these dependencies about MADlib?
> > > >
> > > >
> > > >
> > > > 2. whether the spatial statistics module is suitable for MADlib?
> > > >
> > > >
> > > >
> > > > Thank you in advance.
> > > >
> > > >
> > > > ChenLiang Wang
> > > >
> > > >
> >
> >