Hm, this means these vectors carry around a second representation, which is probably too costly from a memory perspective. Can the caller not just construct these as needed? while constructing it takes time, it seems like that's still a win, and doesn't impact the rest of the code.
On Wed, Jul 4, 2018 at 9:26 AM Vincent Wang <fvunic...@gmail.com> wrote: > Hi Sean, I think the simplest way is to return a > *breeze.linalg.HashVector* when > *org.apache.spark.ml.linalg.SparseVector#asBreeze* is called, and use a > lazy value to store that vector because the construction of > *breeze.linalg.HashVector* has some extra performance cost. > > The code will be like > > class SparseVector @Since("2.0.0") ( > override val size: Int, > @Since("2.0.0") val indices: Array[Int], > @Since("2.0.0") val values: Array[Double]) extends Vector { > > lazy val breezeVector = > breeze.linalg.HashVector.apply(size)(indices.zip(values): _*) > > ..... > > private[spark] override def asBreeze: BV[Double] = breezeVector > > .... > > } > > > > I'm not sure this is the best approach so I think I can file an issue for > further discussion. > > Thanks, > Huafeng > > > Sean Owen <sro...@gmail.com>于2018年7月2日周一 上午11:38写道: > >> This could be a good optimization. But can it be done without changing >> any APIs or slowing anything else down? if so this could be worth a >> pull request. >> On Sun, Jul 1, 2018 at 9:21 PM Vincent Wang <fvunic...@gmail.com> wrote: >> > >> > Hi there, >> > >> > I'm using GBTClassifier do some classification jobs and find the >> performance of scoring stage is not quite satisfying. The trained model has >> about 160 trees and the input feature vector is sparse and its size is >> about 20+. >> > >> > After some digging, I found the model will repeatedly and randomly >> access feature in SparseVector when predicting an input vector, which will >> eventually call function breeze.linalg.SparseVector#apply. That function >> generally uses a binary search to locate the corresponding index so the >> complexity is O(log numNonZero). >> > >> > Then I tried to convert my feature vectors to dense vectors before >> inference and the result shows that the inference stage can speed up for >> about 2~3 times. (Random access in DenseVector is O(1)) >> > >> > So my question is why not use breeze.linalg.HashVector when randomly >> accessing values in SpareVector since the complexity is O(1) according to >> Breeze's documentation, much better than the SparseVector in such case. >> > >> > Thanks, >> > Vincent >> >