Hi ChenLiang! This is great starting point.
I am taking some time reviewing and also getting some help from geospatial experts who are my co-workers. In just a couple more days I will have more concrete feedback on this. Talk to u all again soon. Cheers, Ivan On Tue, Jan 5, 2016 at 4:06 PM, WangChenLiang <[email protected]> wrote: > Hi MADlib Developers, > To follow Ivan and Frank's suggestion, I am trying to propose the > description and interface of Geographically weighted regression (GWR). > PostGIS functions will be invoked to compute distance in some CRS and > extract rectangle coordinates of study area. If MADlib doesn't have access > to PostGIS routines, we can only implement some simple GIS utils with our > own code . > GWR models a local relationship of a numerical dependent variable to > one or more explanatory independent variables to build a model of spatially > varying relationships. It has been widely used for understanding the > spatial pattern of natural or social phenomena . > GWR constructs local equations > seperately for each location in the table incorporating the dependent > and independent variables falling within the bandwidth of each target > geometry. The shape and > extent of the bandwidth is dependent on the spatial kernel type( guass, > exp and bisquare), distance in fixed methods ( or number of neighbors > parameters in adpative methods ). Therefore, the computational burden of > GWR increases with prediction locations. Parallelized GWR is necessary in > high-performance environment such as GPDB. > There are two important hints about GWR. Firstly, GWR can estimate > coefficients in any locations but can only provide diagnostic information > in observation locations. In addition, according to P ez et al.(2011), the > basic GWR is not an appropriate method for small sample sizes (<160). Many > advanced geographically-weighted methods are proposed in some papers (see > Wheeler DC 2009, Brunsdon C et al. 2012,Gollini I et al. 2015) which are > planned to implement in the future. The description about interface > and function for GWR is also provided . Coefficients columns in output are > seperated for easily mapping result in GIS. Can you kindly take a look > and give me advice or feedback to improve it ? Many Thanks! > Best,ChenLiang Wang > > -------------------------------------------------------------------------------------------------------------------------------------- > The description about Geographically Weighted Regression (Spatial > Statistics->Regression Models) > Training Function of geographically weighted regression training function > has the following syntax. > gwregr_train(source_table, > out_table, > dependent_varname, > independent_varname, > kernel_params, > adaptive_option, > ftest_option, > regression_location, > prediction_location, > grouping_cols, > verbose > ) > > ----------------------------------------------------------------------------------------------------------------------------------- > Arguments > source_table > TEXT. The name of the table containing the training data. > out_table > TEXT. Name of the generated table containing the output model. > > The output table contains the following columns. > <...> Any grouping columns provided during training. Present only > if the grouping option is used. > coef_<independent_varname1>, coef_<independent_varname2> ... > FLOAT8[]. Any columns corresponding to independent_varname of the vector > of coefficients of the regression in each location. > r2 FLOAT8. R-squared coefficient of determination of the model. > adjr2 FLOAT8. Adjusted-R-squared coefficient of determination of > the model. > local_cond_no FLOAT8[]. The local condition number of GWR in each > location (see Wheeler D2007) indicates when results are unstable due to > local multicollinearity (above 30). > F1_stats FLOAT8[]. The F-test array{F-statistic,Numerator > DF,Denominator DF,p_value} for comparing Ordinary Linear Regression(OLR) > and GWR models (see Leung et al. 2000) > F2_stats FLOAT8[]. The F-test > array{F-statistic,Numerator DF,Denominator DF,p_value} for comparing > Ordinary Linear Regression(OLR) and GWR models (see Leung et al. 2000) > F3_stats FLOAT8[]. The spatial stationary test statistic for GWR > coefficients (see Leung et al. 2000) > F3_ndf FLOAT8[]. The spatial stationary test Numerator DF for GWR > coefficients > (see Leung et al. 2000) > F3_ddf FLOAT8[]. The spatial stationary test Denominator DF for GWR > coefficients > (see Leung et al. 2000) > F3_pv FLOAT8[]. The spatial stationary test p_value for GWR > coefficients > (see Leung et al. 2000) > F4_stats FLOAT8[]. The F-test > array{F-statistic,Numerator DF,Denominator DF,p_value} for comparing > Ordinary Linear Regression(OLR) and GWR models (see GWR book p92) > num_missing_rows_skipped INTEGER. The number of rows that have > NULL values in the dependent and independent variables, and were skipped in > the computation for each group. > > A summary table named <out_table>_summary is created together with the > output table. It has the following columns: > source_table The data source table name > out_table The output table name > dependent_varname The dependent variable > independent_varname The independent variables > num_rows_processed The total number of rows that were used in the > computation. > num_missing_rows_skipped The total number of rows that were > skipped because of NULL values in them. > kernel_function The spatial kernel function > bandwidth The bandwidth parameter > adaptive_option The Boolean variable indicates whether to perform a > adaptive kernel function. > dependent_varname > TEXT. Expression to evaluate for the dependent variable. > independent_varname > TEXT. Expression list to evaluate for the independent variables. An > intercept variable is not assumed. It is common to provide an explicit > intercept term by including a single constant 1 term in the independent > variable list. > kernel_params(optional) > TEXT,default: 'kernel=guass,bw=CV', Parameters for kernel function. > The kernel parameter is the name of the kernel function to use > ‘gauss’: wgt = exp(-.5*(vdist/bw)^2); > ‘exp’: wgt = exp(-vdist/bw); > ‘bisquare’: wgt = (1-(vdist/bw)^2)^2 if vdist < bw, wgt=0 otherwise; > Where,wgt indicates weight ,vdist indicates vector of distance, and bw > indicates bandwidth. > We can select either CV or AICc when you aren't sure what to use for > the Distance or Number of neighbors parameter.We can also specify a > numerical value for bw.If bw is large enough(above 1e7,for example), the > estimation of coefficients in GWR is equal to the global estimation in > ordinary linear regression. > adaptive_option(optional) > BOOLEAN,default:FALSE. When TRUE, an adaptive kernel is calculated > where the bandwidth corresponds to the number of nearest neighbours (i.e. > adaptive distance) > ftest_option(optional) > BOOLEAN,default:FALSE . When TRUE, three F-tests and > spatial-stationary test of coefficients are also conducted and returned > with the results according to Leung et al. (2000). > regression_location > 2D Point or Polygon Geometry, A geometry (usually 2D point geometry) > representing locations where training should be conducted. The length of > regression_location must be equal to the length of source_table.In most > cases,it is a geometry field of source_table. > prediction_location(optional) > 2D Point or Polygon Geometry,default:regression_location. A geometry > (usually 2D point geometry) representing locations where estimation of > coefficients should be computed. > grouping_cols (optional) > TEXT, default: NULL. An expression list used to group the input > dataset into discrete groups, running one regression per group. Similar to > the SQL GROUP BY clause. When this value is null, no grouping is used and a > single result model is generated. > verbose(optional) > BOOLEAN, default: FALSE. Provides verbose output of the results of > training. > > --------------------------------------------------------------------------------------------------------------------------------------------- > Prediction Function > gwregr_predict(coef, col_ind,newdata_table) > Arguments > coef > FLOAT8[][]. Vector of the coefficients of regression. > col_ind > FLOAT8[]. An array containing the independent variable column names. > newdata_table(optional) > TEXT. default: NULL. The name of table which provide new data in > prediction locations. If prediction_location is same as > regression_locations (default value) in training fucntion, this parameter > is omitted automatically. Otherwise, newdata_table is obligatory to provide > independent variables with identical field names in source_table in > prediction locations . > > > Date: Fri, 18 Dec 2015 09:18:22 -0800 > > Subject: Re: How to contribute a spatial module to MADlib manipulating > objects from PostGIS > > From: [email protected] > > To: [email protected] > > > > Thanks ChenLiang Wang for your interest. > > > > I would repeat Ivan's welcome to you, and I look forward to your > > contributions in the area of GIS. > > > > To answer your questions: > > > > 1. Yes, it is possible to call PostGIS functions from MADlib. > > > > 2. Yes, spatial statistics are suitable for MADlib. > > > > For documentation, please refer to the Apache MADlib wiki > > http://madlib.incubator.apache.org/ > > > > which includes: > > Quick Start Guides > > > > Get going with a minimum of fuss. > > > > - Installation Guide > > < > https://cwiki.apache.org/confluence/display/MADLIB/Installation+Guide> > > - Quick Start Guide for Users > > < > https://cwiki.apache.org/confluence/display/MADLIB/Quick+Start+Guide+for+Users > > > > - Quick Start Guide for Developers > > < > https://cwiki.apache.org/confluence/display/MADLIB/Quick+Start+Guide+for+Developers > > > > > > > > As Ivan mentioned, writing down the functions you would like to build and > > the interface is a good place to begin. Then we can discuss on the open > > mailing list. > > > > Regards, > > Frank > > > > On Thu, Dec 17, 2015 at 8:11 PM, 王晨 亮 <[email protected]> wrote: > > > > > Thanks for your quick reply. Your suggestion is great. I will give a > > > definitions and description for the spatial statistic functions and > > > comparison with ordinary statistic models. > > > > > > > > > > Date: Thu, 17 Dec 2015 21:56:06 -0500 > > > > Subject: Re: How to contribute a spatial module to MADlib > manipulating > > > objects from PostGIS > > > > From: [email protected] > > > > To: [email protected] > > > > > > > > Hi ChenLiang, > > > > > > > > I think your proposal is good and worth trying to do it! > > > > > > > > Can I suggest the first steps if you send a proposal of the function > > > > definitions and the parameters and return values as well as > description > > > of > > > > the functions and what they do. > > > > > > > > Based on that we can discuss the design of the interface and once it > > > looks > > > > good you can start working on the actual implementation of the > coding. > > > > When you get to implementation we can help you on technical > challenges. > > > > > > > > Cheers, > > > > Ivan > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Dec 17, 2015 at 9:50 PM, 王晨 亮 <[email protected]> wrote: > > > > > > > > > Hi MADlib Developers, > > > > > > > > > > > > > > > > > > > > > > > > > I am a GIS Researcher and have some knowledge on PostGIS, Python, > > > > > C/C++,Java and R. > > > > > > > > > > > > > > > > > > > > I have learned some spatial statistical models during My PhD > research > > > in > > > > > GIS. Recently, I have done a job translating GWR (Geographical > Weighted > > > > > Regression) from R into Java for my company. And I would like to > > > > > contribute to MADLib if possible. I believe PostGIS and MADlib > are the > > > > > most powerful extensions of PostgreSQL . Therefore, a spatial > > > statistical > > > > > module connecting the two libraries could be significant . If I can > > > start > > > > > the task , the first goal to implement will be GWR model. > > > > > > > > > > > > > > > > > > > > Now I am reading the developer guide of MADlib. I not quite sure > how to > > > > > contribute a geospatial module to MADlib. Is it possible to > manipulate > > > > > spatial object or attribute from PostGIS in MADlib ? > > > > > > > > > > > > > > > > > > > > So could anyone suggest a few pointers & links that I can follow > to get > > > > > to know: > > > > > > > > > > > > > > > > > > > > 1. how to deal with these dependencies about MADlib? > > > > > > > > > > > > > > > > > > > > 2. whether the spatial statistics module is suitable for MADlib? > > > > > > > > > > > > > > > > > > > > Thank you in advance. > > > > > > > > > > > > > > > ChenLiang Wang > > > > > > > > > > > > > > > > > >
