Hi Alex, Here is the ticket for refining tree predictions. Let's discuss this further on the JIRA. https://issues.apache.org/jira/browse/SPARK-4240
There is no ticket yet for quantile regression. It will be great if you could create one and note down the corresponding loss function and gradient calculations. There is a design doc that Joseph Bradley wrote for supporting boosting algorithms with generic weak learners but it doesn't include implementation details. I can definitely help you understand the existing code if you decide to work on it. However, let's discuss the relevance of the algorithm to MLlib on the JIRA. It seems like a nice addition though I am not sure about the implementation complexity. I will be great to see what others think. -Manish On Tue, Nov 18, 2014 at 10:07 AM, Alessandro Baretta <alexbare...@gmail.com> wrote: > Manish, > > My use case for (asymmetric) absolute error is quite trivially quantile > regression. In other words, I want to use Spark to learn conditional > cumulative distribution functions. See R's GBM quantile regression option. > > If you either find or create a Jira ticket, I would be happy to give it a > shot. Is there a design doc explaining how the gradient boosting algorithm > is laid out in MLLib? I tried reading the code, but without a "Rosetta > stone" it's impossible to make sense of it. > > Alex > > On Mon, Nov 17, 2014 at 8:25 PM, Manish Amde <manish...@gmail.com> wrote: > >> Hi Alessandro, >> >> I think absolute error as splitting criterion might be feasible with the >> current architecture -- I think the sufficient statistics we collect >> currently might be able to support this. Could you let us know scenarios >> where absolute error has significantly outperformed squared error for >> regression trees? Also, what's your use case that makes squared error >> undesirable. >> >> For gradient boosting, you are correct. The weak hypothesis weights refer >> to tree predictions in each of the branches. We plan to explain this in >> the 1.2 documentation and may be add some more clarifications to the >> Javadoc. >> >> I will try to search for JIRAs or create new ones and update this thread. >> >> -Manish >> >> >> On Monday, November 17, 2014, Alessandro Baretta <alexbare...@gmail.com> >> wrote: >> >>> Manish, >>> >>> Thanks for pointing me to the relevant docs. It is unfortunate that >>> absolute error is not supported yet. I can't seem to find a Jira for it. >>> >>> Now, here's the what the comments say in the current master branch: >>> /** >>> * :: Experimental :: >>> * A class that implements Stochastic Gradient Boosting >>> * for regression and binary classification problems. >>> * >>> * The implementation is based upon: >>> * J.H. Friedman. "Stochastic Gradient Boosting." 1999. >>> * >>> * Notes: >>> * - This currently can be run with several loss functions. However, >>> only SquaredError is >>> * fully supported. Specifically, the loss function should be used >>> to compute the gradient >>> * (to re-label training instances on each iteration) and to weight >>> weak hypotheses. >>> * Currently, gradients are computed correctly for the available loss >>> functions, >>> * but weak hypothesis weights are not computed correctly for LogLoss >>> or AbsoluteError. >>> * Running with those losses will likely behave reasonably, but lacks >>> the same guarantees. >>> ... >>> */ >>> >>> By the looks of it, the GradientBoosting API would support an absolute >>> error type loss function to perform quantile regression, except for "weak >>> hypothesis weights". Does this refer to the weights of the leaves of the >>> trees? >>> >>> Alex >>> >>> On Mon, Nov 17, 2014 at 2:24 PM, Manish Amde <manish...@gmail.com> >>> wrote: >>> >>>> Hi Alessandro, >>>> >>>> MLlib v1.1 supports variance for regression and gini impurity and >>>> entropy for classification. >>>> http://spark.apache.org/docs/latest/mllib-decision-tree.html >>>> >>>> If the information gain calculation can be performed by distributed >>>> aggregation then it might be possible to plug it into the existing >>>> implementation. We want to perform such calculations (for e.g. median) for >>>> the gradient boosting models (coming up in the 1.2 release) using absolute >>>> error and deviance as loss functions but I don't think anyone is planning >>>> to work on it yet. :-) >>>> >>>> -Manish >>>> >>>> On Mon, Nov 17, 2014 at 11:11 AM, Alessandro Baretta < >>>> alexbare...@gmail.com> wrote: >>>> >>>>> I see that, as of v. 1.1, MLLib supports regression and classification >>>>> tree >>>>> models. I assume this means that it uses a squared-error loss function >>>>> for >>>>> the first and logistic cost function for the second. I don't see >>>>> support >>>>> for quantile regression via an absolute error cost function. Or am I >>>>> missing something? >>>>> >>>>> If, as it seems, this is missing, how do you recommend to implement it? >>>>> >>>>> Alex >>>>> >>>> >>>> >>> >