FYI: https://issues.apache.org/jira/browse/SPARK-14464
I have submitted a PR as well. On Fri, Mar 18, 2016 at 7:15 AM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > No, I didn't yet - feel free to create a JIRA. > > > > On Thu, 17 Mar 2016 at 22:55 Daniel Siegmann <daniel.siegm...@teamaol.com> > wrote: > >> Hi Nick, >> >> Thanks again for your help with this. Did you create a ticket in JIRA for >> investigating sparse models in LR and / or multivariate summariser? If so, >> can you give me the issue key(s)? If not, would you like me to create these >> tickets? >> >> I'm going to look into this some more and see if I can figure out how to >> implement these fixes. >> >> ~Daniel Siegmann >> >> On Sat, Mar 12, 2016 at 5:53 AM, Nick Pentreath <nick.pentre...@gmail.com >> > wrote: >> >>> Also adding dev list in case anyone else has ideas / views. >>> >>> On Sat, 12 Mar 2016 at 12:52, Nick Pentreath <nick.pentre...@gmail.com> >>> wrote: >>> >>>> Thanks for the feedback. >>>> >>>> I think Spark can certainly meet your use case when your data size >>>> scales up, as the actual model dimension is very small - you will need to >>>> use those indexers or some other mapping mechanism. >>>> >>>> There is ongoing work for Spark 2.0 to make it easier to use models >>>> outside of Spark - also see PMML export (I think mllib logistic regression >>>> is supported but I have to check that). That will help use spark models in >>>> serving environments. >>>> >>>> Finally, I will add a JIRA to investigate sparse models for LR - maybe >>>> also a ticket for multivariate summariser (though I don't think in practice >>>> there will be much to gain). >>>> >>>> >>>> On Fri, 11 Mar 2016 at 21:35, Daniel Siegmann < >>>> daniel.siegm...@teamaol.com> wrote: >>>> >>>>> Thanks for the pointer to those indexers, those are some good >>>>> examples. A good way to go for the trainer and any scoring done in Spark. >>>>> I >>>>> will definitely have to deal with scoring in non-Spark systems though. >>>>> >>>>> I think I will need to scale up beyond what single-node liblinear can >>>>> practically provide. The system will need to handle much larger >>>>> sub-samples >>>>> of this data (and other projects might be larger still). Additionally, the >>>>> system needs to train many models in parallel (hyper-parameter >>>>> optimization >>>>> with n-fold cross-validation, multiple algorithms, different sets of >>>>> features). >>>>> >>>>> Still, I suppose we'll have to consider whether Spark is the best >>>>> system for this. For now though, my job is to see what can be achieved >>>>> with >>>>> Spark. >>>>> >>>>> >>>>> >>>>> On Fri, Mar 11, 2016 at 12:45 PM, Nick Pentreath < >>>>> nick.pentre...@gmail.com> wrote: >>>>> >>>>>> Ok, I think I understand things better now. >>>>>> >>>>>> For Spark's current implementation, you would need to map those >>>>>> features as you mention. You could also use say StringIndexer -> >>>>>> OneHotEncoder or VectorIndexer. You could create a Pipeline to deal with >>>>>> the mapping and training (e.g. >>>>>> http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline). >>>>>> Pipeline supports persistence. >>>>>> >>>>>> But it depends on your scoring use case too - a Spark pipeline can be >>>>>> saved and then reloaded, but you need all of Spark dependencies in your >>>>>> serving app which is often not ideal. If you're doing bulk scoring >>>>>> offline, >>>>>> then it may suit. >>>>>> >>>>>> Honestly though, for that data size I'd certainly go with something >>>>>> like Liblinear :) Spark will ultimately scale better with # training >>>>>> examples for very large scale problems. However there are definitely >>>>>> limitations on model dimension and sparse weight vectors currently. There >>>>>> are potential solutions to these but they haven't been implemented as >>>>>> yet. >>>>>> >>>>>> On Fri, 11 Mar 2016 at 18:35 Daniel Siegmann < >>>>>> daniel.siegm...@teamaol.com> wrote: >>>>>> >>>>>>> On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath < >>>>>>> nick.pentre...@gmail.com> wrote: >>>>>>> >>>>>>>> Would you mind letting us know the # training examples in the >>>>>>>> datasets? Also, what do your features look like? Are they text, >>>>>>>> categorical >>>>>>>> etc? You mention that most rows only have a few features, and all rows >>>>>>>> together have a few 10,000s features, yet your max feature value is 20 >>>>>>>> million. How are your constructing your feature vectors to get a 20 >>>>>>>> million >>>>>>>> size? The only realistic way I can see this situation occurring in >>>>>>>> practice >>>>>>>> is with feature hashing (HashingTF). >>>>>>>> >>>>>>> >>>>>>> The sub-sample I'm currently training on is about 50K rows, so ... >>>>>>> small. >>>>>>> >>>>>>> The features causing this issue are numeric (int) IDs for ... lets >>>>>>> call it "Thing". For each Thing in the record, we set the feature >>>>>>> Thing.id to a value of 1.0 in our vector (which is of course a >>>>>>> SparseVector). I'm not sure how IDs are generated for Things, but >>>>>>> they can be large numbers. >>>>>>> >>>>>>> The largest Thing ID is around 20 million, so that ends up being the >>>>>>> size of the vector. But in fact there are fewer than 10,000 unique Thing >>>>>>> IDs in this data. The mean number of features per record in what I'm >>>>>>> currently training against is 41, while the maximum for any given record >>>>>>> was 1754. >>>>>>> >>>>>>> It is possible to map the features into a small set (just need to >>>>>>> zipWithIndex), but this is undesirable because of the added complexity >>>>>>> (not >>>>>>> just for the training, but also anything wanting to score against the >>>>>>> model). It might be a little easier if this could be encapsulated within >>>>>>> the model object itself (perhaps via composition), though I'm not sure >>>>>>> how >>>>>>> feasible that is. >>>>>>> >>>>>>> But I'd rather not bother with dimensionality reduction at all - >>>>>>> since we can train using liblinear in just a few minutes, it doesn't >>>>>>> seem >>>>>>> necessary. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> MultivariateOnlineSummarizer uses dense arrays, but it should be >>>>>>>> possible to enable sparse data. Though in theory, the result will tend >>>>>>>> to >>>>>>>> be dense anyway, unless you have very many entries in the input feature >>>>>>>> vector that never occur and are actually zero throughout the data set >>>>>>>> (which it seems is the case with your data?). So I doubt whether using >>>>>>>> sparse vectors for the summarizer would improve performance in general. >>>>>>>> >>>>>>> >>>>>>> Yes, that is exactly my case - the vast majority of entries in the >>>>>>> input feature vector will *never* occur. Presumably that means most >>>>>>> of the values in the aggregators' arrays will be zero. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> LR doesn't accept a sparse weight vector, as it uses dense vectors >>>>>>>> for coefficients and gradients currently. When using L1 >>>>>>>> regularization, it >>>>>>>> could support sparse weight vectors, but the current implementation >>>>>>>> doesn't >>>>>>>> do that yet. >>>>>>>> >>>>>>> >>>>>>> Good to know it is theoretically possible to implement. I'll have to >>>>>>> give it some thought. In the meantime I guess I'll experiment with >>>>>>> coalescing the data to minimize the communication overhead. >>>>>>> >>>>>>> Thanks again. >>>>>>> >>>>>> >>>>> >>