Re: Spark ML - Scaling logistic regression for many features

Daniel Siegmann Sat, 19 Mar 2016 00:39:09 -0700

Hi Nick,

Thanks again for your help with this. Did you create a ticket in JIRA for
investigating sparse models in LR and / or multivariate summariser? If so,
can you give me the issue key(s)? If not, would you like me to create these
tickets?


I'm going to look into this some more and see if I can figure out how to
implement these fixes.

~Daniel Siegmann

On Sat, Mar 12, 2016 at 5:53 AM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> Also adding dev list in case anyone else has ideas / views.
>
> On Sat, 12 Mar 2016 at 12:52, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
>> Thanks for the feedback.
>>
>> I think Spark can certainly meet your use case when your data size scales
>> up, as the actual model dimension is very small - you will need to use
>> those indexers or some other mapping mechanism.
>>
>> There is ongoing work for Spark 2.0 to make it easier to use models
>> outside of Spark - also see PMML export (I think mllib logistic regression
>> is supported but I have to check that). That will help use spark models in
>> serving environments.
>>
>> Finally, I will add a JIRA to investigate sparse models for LR - maybe
>> also a ticket for multivariate summariser (though I don't think in practice
>> there will be much to gain).
>>
>>
>> On Fri, 11 Mar 2016 at 21:35, Daniel Siegmann <
>> daniel.siegm...@teamaol.com> wrote:
>>
>>> Thanks for the pointer to those indexers, those are some good examples.
>>> A good way to go for the trainer and any scoring done in Spark. I will
>>> definitely have to deal with scoring in non-Spark systems though.
>>>
>>> I think I will need to scale up beyond what single-node liblinear can
>>> practically provide. The system will need to handle much larger sub-samples
>>> of this data (and other projects might be larger still). Additionally, the
>>> system needs to train many models in parallel (hyper-parameter optimization
>>> with n-fold cross-validation, multiple algorithms, different sets of
>>> features).
>>>
>>> Still, I suppose we'll have to consider whether Spark is the best system
>>> for this. For now though, my job is to see what can be achieved with Spark.
>>>
>>>
>>>
>>> On Fri, Mar 11, 2016 at 12:45 PM, Nick Pentreath <
>>> nick.pentre...@gmail.com> wrote:
>>>
>>>> Ok, I think I understand things better now.
>>>>
>>>> For Spark's current implementation, you would need to map those
>>>> features as you mention. You could also use say StringIndexer ->
>>>> OneHotEncoder or VectorIndexer. You could create a Pipeline to deal with
>>>> the mapping and training (e.g.
>>>> http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline).
>>>> Pipeline supports persistence.
>>>>
>>>> But it depends on your scoring use case too - a Spark pipeline can be
>>>> saved and then reloaded, but you need all of Spark dependencies in your
>>>> serving app which is often not ideal. If you're doing bulk scoring offline,
>>>> then it may suit.
>>>>
>>>> Honestly though, for that data size I'd certainly go with something
>>>> like Liblinear :) Spark will ultimately scale better with # training
>>>> examples for very large scale problems. However there are definitely
>>>> limitations on model dimension and sparse weight vectors currently. There
>>>> are potential solutions to these but they haven't been implemented as yet.
>>>>
>>>> On Fri, 11 Mar 2016 at 18:35 Daniel Siegmann <
>>>> daniel.siegm...@teamaol.com> wrote:
>>>>
>>>>> On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath <
>>>>> nick.pentre...@gmail.com> wrote:
>>>>>
>>>>>> Would you mind letting us know the # training examples in the
>>>>>> datasets? Also, what do your features look like? Are they text, 
>>>>>> categorical
>>>>>> etc? You mention that most rows only have a few features, and all rows
>>>>>> together have a few 10,000s features, yet your max feature value is 20
>>>>>> million. How are your constructing your feature vectors to get a 20 
>>>>>> million
>>>>>> size? The only realistic way I can see this situation occurring in 
>>>>>> practice
>>>>>> is with feature hashing (HashingTF).
>>>>>>
>>>>>
>>>>> The sub-sample I'm currently training on is about 50K rows, so ...
>>>>> small.
>>>>>
>>>>> The features causing this issue are numeric (int) IDs for ... lets
>>>>> call it "Thing". For each Thing in the record, we set the feature
>>>>> Thing.id to a value of 1.0 in our vector (which is of course a
>>>>> SparseVector). I'm not sure how IDs are generated for Things, but
>>>>> they can be large numbers.
>>>>>
>>>>> The largest Thing ID is around 20 million, so that ends up being the
>>>>> size of the vector. But in fact there are fewer than 10,000 unique Thing
>>>>> IDs in this data. The mean number of features per record in what I'm
>>>>> currently training against is 41, while the maximum for any given record
>>>>> was 1754.
>>>>>
>>>>> It is possible to map the features into a small set (just need to
>>>>> zipWithIndex), but this is undesirable because of the added complexity 
>>>>> (not
>>>>> just for the training, but also anything wanting to score against the
>>>>> model). It might be a little easier if this could be encapsulated within
>>>>> the model object itself (perhaps via composition), though I'm not sure how
>>>>> feasible that is.
>>>>>
>>>>> But I'd rather not bother with dimensionality reduction at all - since
>>>>> we can train using liblinear in just a few minutes, it doesn't seem
>>>>> necessary.
>>>>>
>>>>>
>>>>>>
>>>>>> MultivariateOnlineSummarizer uses dense arrays, but it should be
>>>>>> possible to enable sparse data. Though in theory, the result will tend to
>>>>>> be dense anyway, unless you have very many entries in the input feature
>>>>>> vector that never occur and are actually zero throughout the data set
>>>>>> (which it seems is the case with your data?). So I doubt whether using
>>>>>> sparse vectors for the summarizer would improve performance in general.
>>>>>>
>>>>>
>>>>> Yes, that is exactly my case - the vast majority of entries in the
>>>>> input feature vector will *never* occur. Presumably that means most
>>>>> of the values in the aggregators' arrays will be zero.
>>>>>
>>>>>
>>>>>>
>>>>>> LR doesn't accept a sparse weight vector, as it uses dense vectors
>>>>>> for coefficients and gradients currently. When using L1 regularization, 
>>>>>> it
>>>>>> could support sparse weight vectors, but the current implementation 
>>>>>> doesn't
>>>>>> do that yet.
>>>>>>
>>>>>
>>>>> Good to know it is theoretically possible to implement. I'll have to
>>>>> give it some thought. In the meantime I guess I'll experiment with
>>>>> coalescing the data to minimize the communication overhead.
>>>>>
>>>>> Thanks again.
>>>>>
>>>>
>>>

Re: Spark ML - Scaling logistic regression for many features

Reply via email to