Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13139#discussion_r64790228
  
    --- Diff: docs/ml-classification-regression.md ---
    @@ -374,6 +374,137 @@ regression model and extracting model summary 
statistics.
     
     </div>
     
    +## Generalized linear regression
    +
    +Contrasted with linear regression where the output is assumed to follow a 
Gaussian
    +distribution, [generalized linear 
models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) are 
specifications of linear models where the response variable $Y_i$ may take on 
_any_
    +distribution from the [exponential family of 
distributions](https://en.wikipedia.org/wiki/Exponential_family).
    +Spark's `GeneralizedLinearRegression` interface
    +allows for flexible specification of GLMs which can be used for various 
types of
    +prediction problems including linear regression, Poisson regression, 
logistic regression, and others.
    +Currently in `spark.ml`, only a subset of the exponential family 
distributions are supported and they are listed
    +[below](#available-families).
    +
    +**NOTE**: Spark currently only supports up to 4096 features through its 
`GeneralizedLinearRegression`
    +interface, and will throw an exception if this constraint is exceeded. See 
the [advanced section](ml-advanced) for more details.
    + Still, for linear and logistic regression, models with an increased 
number of features can be trained 
    + using the `LinearRegression` and `LogisticRegression` estimators.
    +
    +The canonical form of an exponential family distribution is given as:
    +
    +$$
    +f_Y(y|\theta, \tau) = h(y, \tau)\exp{\left( \frac{\theta \cdot T(y) - 
A(\theta)}{d(\tau)} \right)}
    +$$
    +
    +where $\theta$ is the parameter of interest and $\tau$ is a dispersion 
parameter. In a GLM the response variable $Y_i$ is assumed to be drawn from an 
exponential family distribution:
    +
    +$$
    +Y_i \sim f\left(\cdot|\theta_i, \tau \right)
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the expected 
value of the response variable $\mu_i$ by
    +
    +$$
    +\mu_i = A'(\theta_i)
    +$$
    +
    +Here, $A'(\theta_i)$ is defined by the form of the exponential family 
distribution used. GLMs also allow specification
    +of a link function, which defines the relationship between the expected 
value of the response variable $\mu_i$
    +and the so called _linear predictor_ $\eta_i$:
    +
    +$$
    +g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta}
    +$$
    +
    +Often, the link function is chosen such that $A' = g^{-1}$, which yields a 
simplified relationship
    +between the parameter of interest $\theta$ and the linear predictor 
$\eta$. In this case, the link
    +function $g(\mu)$ is said to be the "canonical" link function.
    +
    +$$
    +\theta_i = A'^{-1}(\mu_i) = g(g^{-1}(\eta_i)) = \eta_i
    +$$
    +
    +A GLM finds the regression coefficients $\vec{\beta}$ which maximize the 
likelihood function.
    +
    +$$
    +\min_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) =
    +\prod_{i=1}^{N} h(y_i, \tau) \exp{\left(\frac{y_i\theta_i - 
A(\theta_i)}{d(\tau)}\right)}
    +$$
    +
    +where the parameter of interest $\theta_i$ is related to the regression 
coefficients $\vec{\beta}$
    +by
    +
    +$$
    +\theta_i = A'(g^{-1}(\vec{x_i} \cdot \vec{\beta}))
    --- End diff --
    
    Should A' be inverted?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to