Also adding dev list in case anyone else has ideas / views.

On Sat, 12 Mar 2016 at 12:52, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> Thanks for the feedback.
>
> I think Spark can certainly meet your use case when your data size scales
> up, as the actual model dimension is very small - you will need to use
> those indexers or some other mapping mechanism.
>
> There is ongoing work for Spark 2.0 to make it easier to use models
> outside of Spark - also see PMML export (I think mllib logistic regression
> is supported but I have to check that). That will help use spark models in
> serving environments.
>
> Finally, I will add a JIRA to investigate sparse models for LR - maybe
> also a ticket for multivariate summariser (though I don't think in practice
> there will be much to gain).
>
>
> On Fri, 11 Mar 2016 at 21:35, Daniel Siegmann <daniel.siegm...@teamaol.com>
> wrote:
>
>> Thanks for the pointer to those indexers, those are some good examples. A
>> good way to go for the trainer and any scoring done in Spark. I will
>> definitely have to deal with scoring in non-Spark systems though.
>>
>> I think I will need to scale up beyond what single-node liblinear can
>> practically provide. The system will need to handle much larger sub-samples
>> of this data (and other projects might be larger still). Additionally, the
>> system needs to train many models in parallel (hyper-parameter optimization
>> with n-fold cross-validation, multiple algorithms, different sets of
>> features).
>>
>> Still, I suppose we'll have to consider whether Spark is the best system
>> for this. For now though, my job is to see what can be achieved with Spark.
>>
>>
>>
>> On Fri, Mar 11, 2016 at 12:45 PM, Nick Pentreath <
>> nick.pentre...@gmail.com> wrote:
>>
>>> Ok, I think I understand things better now.
>>>
>>> For Spark's current implementation, you would need to map those features
>>> as you mention. You could also use say StringIndexer -> OneHotEncoder or
>>> VectorIndexer. You could create a Pipeline to deal with the mapping and
>>> training (e.g.
>>> http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline).
>>> Pipeline supports persistence.
>>>
>>> But it depends on your scoring use case too - a Spark pipeline can be
>>> saved and then reloaded, but you need all of Spark dependencies in your
>>> serving app which is often not ideal. If you're doing bulk scoring offline,
>>> then it may suit.
>>>
>>> Honestly though, for that data size I'd certainly go with something like
>>> Liblinear :) Spark will ultimately scale better with # training examples
>>> for very large scale problems. However there are definitely limitations on
>>> model dimension and sparse weight vectors currently. There are potential
>>> solutions to these but they haven't been implemented as yet.
>>>
>>> On Fri, 11 Mar 2016 at 18:35 Daniel Siegmann <
>>> daniel.siegm...@teamaol.com> wrote:
>>>
>>>> On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath <
>>>> nick.pentre...@gmail.com> wrote:
>>>>
>>>>> Would you mind letting us know the # training examples in the
>>>>> datasets? Also, what do your features look like? Are they text, 
>>>>> categorical
>>>>> etc? You mention that most rows only have a few features, and all rows
>>>>> together have a few 10,000s features, yet your max feature value is 20
>>>>> million. How are your constructing your feature vectors to get a 20 
>>>>> million
>>>>> size? The only realistic way I can see this situation occurring in 
>>>>> practice
>>>>> is with feature hashing (HashingTF).
>>>>>
>>>>
>>>> The sub-sample I'm currently training on is about 50K rows, so ...
>>>> small.
>>>>
>>>> The features causing this issue are numeric (int) IDs for ... lets
>>>> call it "Thing". For each Thing in the record, we set the feature
>>>> Thing.id to a value of 1.0 in our vector (which is of course a
>>>> SparseVector). I'm not sure how IDs are generated for Things, but they
>>>> can be large numbers.
>>>>
>>>> The largest Thing ID is around 20 million, so that ends up being the
>>>> size of the vector. But in fact there are fewer than 10,000 unique Thing
>>>> IDs in this data. The mean number of features per record in what I'm
>>>> currently training against is 41, while the maximum for any given record
>>>> was 1754.
>>>>
>>>> It is possible to map the features into a small set (just need to
>>>> zipWithIndex), but this is undesirable because of the added complexity (not
>>>> just for the training, but also anything wanting to score against the
>>>> model). It might be a little easier if this could be encapsulated within
>>>> the model object itself (perhaps via composition), though I'm not sure how
>>>> feasible that is.
>>>>
>>>> But I'd rather not bother with dimensionality reduction at all - since
>>>> we can train using liblinear in just a few minutes, it doesn't seem
>>>> necessary.
>>>>
>>>>
>>>>>
>>>>> MultivariateOnlineSummarizer uses dense arrays, but it should be
>>>>> possible to enable sparse data. Though in theory, the result will tend to
>>>>> be dense anyway, unless you have very many entries in the input feature
>>>>> vector that never occur and are actually zero throughout the data set
>>>>> (which it seems is the case with your data?). So I doubt whether using
>>>>> sparse vectors for the summarizer would improve performance in general.
>>>>>
>>>>
>>>> Yes, that is exactly my case - the vast majority of entries in the
>>>> input feature vector will *never* occur. Presumably that means most of
>>>> the values in the aggregators' arrays will be zero.
>>>>
>>>>
>>>>>
>>>>> LR doesn't accept a sparse weight vector, as it uses dense vectors for
>>>>> coefficients and gradients currently. When using L1 regularization, it
>>>>> could support sparse weight vectors, but the current implementation 
>>>>> doesn't
>>>>> do that yet.
>>>>>
>>>>
>>>> Good to know it is theoretically possible to implement. I'll have to
>>>> give it some thought. In the meantime I guess I'll experiment with
>>>> coalescing the data to minimize the communication overhead.
>>>>
>>>> Thanks again.
>>>>
>>>
>>

Reply via email to