Re: [SparkML] Random access in SparseVector will slow down inference stage for some tree based models

Sean Owen Wed, 04 Jul 2018 08:01:39 -0700

Hm, this means these vectors carry around a second representation, which is
probably too costly from a memory perspective. Can the caller not just
construct these as needed? while constructing it takes time, it seems like
that's still a win, and doesn't impact the rest of the code.


On Wed, Jul 4, 2018 at 9:26 AM Vincent Wang <fvunic...@gmail.com> wrote:

> Hi Sean, I think the simplest way is to return a
> *breeze.linalg.HashVector* when
> *org.apache.spark.ml.linalg.SparseVector#asBreeze* is called, and use a
> lazy value to store that vector because the construction of
> *breeze.linalg.HashVector* has some extra performance cost.
>
> The code will be like
>
> class SparseVector @Since("2.0.0") (
>     override val size: Int,
>     @Since("2.0.0") val indices: Array[Int],
>     @Since("2.0.0") val values: Array[Double]) extends Vector {
>
>   lazy val breezeVector = 
> breeze.linalg.HashVector.apply(size)(indices.zip(values): _*)
>
>   .....
>
>   private[spark] override def asBreeze: BV[Double] = breezeVector
>
>   ....
>
> }
>
>
>
> I'm not sure this is the best approach so I think I can file an issue for
> further discussion.
>
> Thanks,
> Huafeng
>
>
> Sean Owen <sro...@gmail.com>于2018年7月2日周一 上午11:38写道：
>
>> This could be a good optimization. But can it be done without changing
>> any APIs or slowing anything else down? if so this could be worth a
>> pull request.
>> On Sun, Jul 1, 2018 at 9:21 PM Vincent Wang <fvunic...@gmail.com> wrote:
>> >
>> > Hi there,
>> >
>> > I'm using GBTClassifier do some classification jobs and find the
>> performance of scoring stage is not quite satisfying. The trained model has
>> about 160 trees and the input feature vector is sparse and its size is
>> about 20+.
>> >
>> > After some digging, I found the model will repeatedly and randomly
>> access feature in SparseVector when predicting an input vector, which will
>> eventually call function breeze.linalg.SparseVector#apply. That function
>> generally uses a binary search to locate the corresponding index so the
>> complexity is O(log numNonZero).
>> >
>> > Then I tried to convert my feature vectors to dense vectors before
>> inference and the result shows that the inference stage can speed up for
>> about 2~3 times. (Random access in DenseVector is O(1))
>> >
>> > So my question is why not use breeze.linalg.HashVector when randomly
>> accessing values in SpareVector since the complexity is O(1) according to
>> Breeze's documentation, much better than the SparseVector in such case.
>> >
>> > Thanks,
>> > Vincent
>>
>

Re: [SparkML] Random access in SparseVector will slow down inference stage for some tree based models

Reply via email to