Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/13139#discussion_r64803173 --- Diff: docs/ml-classification-regression.md --- @@ -374,6 +374,137 @@ regression model and extracting model summary statistics. </div> +## Generalized linear regression + +Contrasted with linear regression where the output is assumed to follow a Gaussian +distribution, [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) are specifications of linear models where the response variable $Y_i$ may take on _any_ +distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family). +Spark's `GeneralizedLinearRegression` interface +allows for flexible specification of GLMs which can be used for various types of +prediction problems including linear regression, Poisson regression, logistic regression, and others. +Currently in `spark.ml`, only a subset of the exponential family distributions are supported and they are listed +[below](#available-families). + +**NOTE**: Spark currently only supports up to 4096 features through its `GeneralizedLinearRegression` +interface, and will throw an exception if this constraint is exceeded. See the [advanced section](ml-advanced) for more details. + Still, for linear and logistic regression, models with an increased number of features can be trained + using the `LinearRegression` and `LogisticRegression` estimators. + +The canonical form of an exponential family distribution is given as: + +$$ +f_Y(y|\theta, \tau) = h(y, \tau)\exp{\left( \frac{\theta \cdot T(y) - A(\theta)}{d(\tau)} \right)} --- End diff -- So, it seems as though every source on the internet, academic and otherwise, explains GLMs/exponential families differently with different notation and different terminology. My understanding is that GLMs usually work with an exponential family in its "natural" form, which is a transformed version of an even more generic specification of exponential families. Most _every_ resource I find **besides** wikipedia assumes this "natural" form and does not even mention it. So the `T(y)` appears sometimes, but mostly not. I think the updated explanation is correct, but please let me know if you think it could be clearer.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org