Github user sethah commented on a diff in the pull request:
https://github.com/apache/spark/pull/13139#discussion_r63890670
--- Diff: docs/ml-classification-regression.md ---
@@ -374,6 +374,197 @@ regression model and extracting model summary
statistics.
</div>
+## Generalized linear regression
+
+When working with data that has a relatively small number of features (<
4096), Spark's GeneralizedLinearRegression interface
+allows for flexible specification of [generalized linear
models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) which
can be used for various types of
+prediction problems including linear regression, Poisson regression,
logistic regression, and others.
+
+Contrasted with linear regression where the output is assumed to follow a
Gaussian
+distribution, GLMs are specifications of linear models where the response
variable $Y_i$ may take on _any_
+distribution from the [exponential family of
distributions](https://en.wikipedia.org/wiki/Exponential_family).
+
+$$
+Y_i \sim f\left(\cdot|\theta_i, \phi, w_i\right)
+$$
+
+An exponential family distribution is any probability distribution of the
form
+
+$$
+f\left(y|\theta, \phi, w\right) = \exp{\left(\frac{y\theta -
b(\theta)}{\phi/w} - c(y, \phi)\right)}
+$$
+
+where the parameter of interest $\theta_i$ is related to the expected
value of the response variable
+$\mu_i$ by
+
+$$
+\theta_i = h(\mu_i)
+$$
+
+Here, $h(\mu_i)$ is defined by the form of the exponential family
distribution used. GLMs also allow specification
+of a link function, which defines the relationship between the expected
value of the response variable $\mu_i$
+and the so called _linear predictor_ $\eta_i$:
+
+$$
+g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta}
+$$
+
+Often, the link function is chosen such that $h(\mu) = g(\mu)$, which
yields a simplified relationship
+between the parameter of interest $\theta$ and the linear predictor
$\eta$. In this case, the link
+function $g(\mu)$ is said to be the "canonical" link function.
+
+$$
+\theta_i = h(g^{-1}(\eta_i)) = \eta_i
+$$
+
+A GLM finds the regression coefficients $\vec{\beta}$ which maximize the
likelihood function.
+
+$$
+\min_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) =
+\prod_{i=1}^{N} \exp{\left(\frac{y_i\theta_i - b(\theta_i)}{\phi/w_i} -
c(y_i, \phi)\right)}
+$$
+
+where the parameter of interest $\theta_i$ is related to the regression
coefficients $\vec{\beta}$
+by
+
+$$
+\theta_i = h(g^{-1}(\vec{x_i} \cdot \vec{\beta}))
+$$
+
+Spark's generalized linear regression interface also provides summary
statistics for diagnosing the
+fit of GLM models, including residuals, p-values, deviances, the Akaike
information criterion, and
--- End diff --
I would prefer not to. Summary statistics are relatively easy to add and so
this could change rather frequently. We should let the examples document them.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]