3) It is not designed for dense feature vectors.
On Thu, May 15, 2014 at 8:33 PM, Xiangrui Meng <men...@gmail.com> wrote: > I submitted a PR for standardizing the text format for vectors and > labeled data: https://github.com/apache/spark/pull/685 > > Once it gets merged, saveAsTextFile and loading should be consistent. > I didn't choose LibSVM as the default format because two reasons: > > 1) It doesn't contain feature dimension info in the record. We need to > scan the dataset to get that info. > 2) It saves index:value tuples. Putting indices together can help data > compression. Same for value if there are many binary features. > > Best, > Xiangrui > > On Wed, May 7, 2014 at 10:25 PM, Debasish Das <debasish.da...@gmail.com> > wrote: >> Hi, >> >> I see ALS is still using Array[Int] but for other mllib algorithm we moved >> to Vector[Double] so that it can support either dense and sparse formats... >> >> ALS can stay in Array[Int] due to the Netflix format for input datasets >> which is well defined but it helps if we move ALS to Vector[Double] as >> well...that way all algorithms will be consistent... >> >> The second issue is that toString on SparseVector does not write libsvm >> format but something not very generic...can we change the >> SparseVector.toString to write as libsvm output ? I am dumping a sample of >> dataset to see how mllib glm compares with the glmnet-R package for QoR... >> >> Thanks. >> Deb >> >> On Mon, May 5, 2014 at 4:05 PM, David Hall <d...@cs.berkeley.edu> wrote: >>> >>>> On Mon, May 5, 2014 at 3:40 PM, DB Tsai <dbt...@stanford.edu> wrote: >>>> >>>> > David, >>>> > >>>> > Could we use Int, Long, Float as the data feature spaces, and Double for >>>> > optimizer? >>>> > >>>> >>>> Yes. Breeze doesn't allow operations on mixed types, so you'd need to >>>> convert the double vectors to Floats if you wanted, e.g. dot product with >>>> the weights vector. >>>> >>>> You might also be interested in FeatureVector, which is just a wrapper >>>> around Array[Int] that emulates an indicator vector. It supports dot >>>> products, axpy, etc. >>>> >>>> -- David >>>> >>>> >>>> > >>>> > >>>> > Sincerely, >>>> > >>>> > DB Tsai >>>> > ------------------------------------------------------- >>>> > My Blog: https://www.dbtsai.com >>>> > LinkedIn: https://www.linkedin.com/in/dbtsai >>>> > >>>> > >>>> > On Mon, May 5, 2014 at 3:06 PM, David Hall <d...@cs.berkeley.edu> >>>> wrote: >>>> > >>>> > > Lbfgs and other optimizers would not work immediately, as they require >>>> > > vector spaces over double. Otherwise it should work. >>>> > > On May 5, 2014 3:03 PM, "DB Tsai" <dbt...@stanford.edu> wrote: >>>> > > >>>> > > > Breeze could take any type (Int, Long, Double, and Float) in the >>>> matrix >>>> > > > template. >>>> > > > >>>> > > > >>>> > > > Sincerely, >>>> > > > >>>> > > > DB Tsai >>>> > > > ------------------------------------------------------- >>>> > > > My Blog: https://www.dbtsai.com >>>> > > > LinkedIn: https://www.linkedin.com/in/dbtsai >>>> > > > >>>> > > > >>>> > > > On Mon, May 5, 2014 at 2:56 PM, Debasish Das < >>>> debasish.da...@gmail.com >>>> > > > >wrote: >>>> > > > >>>> > > > > Is this a breeze issue or breeze can take templates on float / >>>> > double ? >>>> > > > > >>>> > > > > If breeze can take templates then it is a minor fix for >>>> Vectors.scala >>>> > > > right >>>> > > > > ? >>>> > > > > >>>> > > > > Thanks. >>>> > > > > Deb >>>> > > > > >>>> > > > > >>>> > > > > On Mon, May 5, 2014 at 2:45 PM, DB Tsai <dbt...@stanford.edu> >>>> wrote: >>>> > > > > >>>> > > > > > +1 Would be nice that we can use different type in Vector. >>>> > > > > > >>>> > > > > > >>>> > > > > > Sincerely, >>>> > > > > > >>>> > > > > > DB Tsai >>>> > > > > > ------------------------------------------------------- >>>> > > > > > My Blog: https://www.dbtsai.com >>>> > > > > > LinkedIn: https://www.linkedin.com/in/dbtsai >>>> > > > > > >>>> > > > > > >>>> > > > > > On Mon, May 5, 2014 at 2:41 PM, Debasish Das < >>>> > > debasish.da...@gmail.com >>>> > > > > > >wrote: >>>> > > > > > >>>> > > > > > > Hi, >>>> > > > > > > >>>> > > > > > > Why mllib vector is using double as default ? >>>> > > > > > > >>>> > > > > > > /** >>>> > > > > > > >>>> > > > > > > * Represents a numeric vector, whose index type is Int and >>>> value >>>> > > > type >>>> > > > > is >>>> > > > > > > Double. >>>> > > > > > > >>>> > > > > > > */ >>>> > > > > > > >>>> > > > > > > trait Vector extends Serializable { >>>> > > > > > > >>>> > > > > > > >>>> > > > > > > /** >>>> > > > > > > >>>> > > > > > > * Size of the vector. >>>> > > > > > > >>>> > > > > > > */ >>>> > > > > > > >>>> > > > > > > def size: Int >>>> > > > > > > >>>> > > > > > > >>>> > > > > > > /** >>>> > > > > > > >>>> > > > > > > * Converts the instance to a double array. >>>> > > > > > > >>>> > > > > > > */ >>>> > > > > > > >>>> > > > > > > def toArray: Array[Double] >>>> > > > > > > >>>> > > > > > > Don't we need a template on float/double ? This will give us >>>> > memory >>>> > > > > > > savings... >>>> > > > > > > >>>> > > > > > > Thanks. >>>> > > > > > > >>>> > > > > > > Deb >>>> > > > > > > >>>> > > > > > >>>> > > > > >>>> > > > >>>> > > >>>> > >>>> >>> >>>