Please find my comments inline. -Xiangrui On Wed, May 28, 2014 at 11:18 AM, Bharath Ravi Kumar <reachb...@gmail.com> wrote: > I'm looking to reuse the LogisticRegression model (with SGD) to predict a > real-valued outcome variable. (I understand that logistic regression is > generally applied to predict binary outcome, but for various reasons, this > model suits our needs better than LinearRegression). Related to that I have > the following questions: > > 1) Can the current LogisticRegression model be used as is to train based on > binary input (i.e. explanatory) features, or is there an assumption that > the explanatory features must be continuous? >
Binary features should be okay. > 2) I intend to reuse the current class to train a model on LabeledPoints > where the label is a real value (and not 0 / 1). I'd like to know if > invoking setValidateData(false) would suffice or if one must override the > validator to achieve this. > I'm not sure whether the loss function makes sense with real valued labels. We may use the assumption that the label is binary to simplify the computation of loss. You can take a look at the code and see whether the loss function fits your model. > 3) I recall seeing an experimental method on the class ( > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala) > that clears the threshold separating positive & negative predictions. Once > the model is trained on real valued labels, would clearing this flag > suffice to predict an outcome that is continous in nature? > If you clear the threshold, it outputs the raw scores from the logistic function. > Thanks, > Bharath > > P.S: I'm writing to dev@ and not user@ assuming that lib changes might be > necessary. Apologies if the mailing list is incorrect.