Re: mllib vector templates

Xiangrui Meng Fri, 16 May 2014 16:46:33 -0700

3) It is not designed for dense feature vectors.


On Thu, May 15, 2014 at 8:33 PM, Xiangrui Meng <[email protected]> wrote:
> I submitted a PR for standardizing the text format for vectors and
> labeled data: https://github.com/apache/spark/pull/685
>
> Once it gets merged, saveAsTextFile and loading should be consistent.
> I didn't choose LibSVM as the default format because two reasons:
>
> 1) It doesn't contain feature dimension info in the record. We need to
> scan the dataset to get that info.
> 2) It saves index:value tuples. Putting indices together can help data
> compression. Same for value if there are many binary features.
>
> Best,
> Xiangrui
>
> On Wed, May 7, 2014 at 10:25 PM, Debasish Das <[email protected]> 
> wrote:
>> Hi,
>>
>> I see ALS is still using Array[Int] but for other mllib algorithm we moved
>> to Vector[Double] so that it can support either dense and sparse formats...
>>
>> ALS can stay in Array[Int] due to the Netflix format for input datasets
>> which is well defined but it helps if we move ALS to Vector[Double] as
>> well...that way all algorithms will be consistent...
>>
>> The second issue is that toString on SparseVector does not write libsvm
>> format but something not very generic...can we change the
>> SparseVector.toString to write as libsvm output ? I am dumping a sample of
>> dataset to see how mllib glm compares with the glmnet-R package for QoR...
>>
>> Thanks.
>> Deb
>>
>> On Mon, May 5, 2014 at 4:05 PM, David Hall <[email protected]> wrote:
>>>
>>>> On Mon, May 5, 2014 at 3:40 PM, DB Tsai <[email protected]> wrote:
>>>>
>>>> > David,
>>>> >
>>>> > Could we use Int, Long, Float as the data feature spaces, and Double for
>>>> > optimizer?
>>>> >
>>>>
>>>> Yes. Breeze doesn't allow operations on mixed types, so you'd need to
>>>> convert the double vectors to Floats if you wanted, e.g. dot product with
>>>> the weights vector.
>>>>
>>>> You might also be interested in FeatureVector, which is just a wrapper
>>>> around Array[Int] that emulates an indicator vector. It supports dot
>>>> products, axpy, etc.
>>>>
>>>> -- David
>>>>
>>>>
>>>> >
>>>> >
>>>> > Sincerely,
>>>> >
>>>> > DB Tsai
>>>> > -------------------------------------------------------
>>>> > My Blog: https://www.dbtsai.com
>>>> > LinkedIn: https://www.linkedin.com/in/dbtsai
>>>> >
>>>> >
>>>> > On Mon, May 5, 2014 at 3:06 PM, David Hall <[email protected]>
>>>> wrote:
>>>> >
>>>> > > Lbfgs and other optimizers would not work immediately, as they require
>>>> > > vector spaces over double. Otherwise it should work.
>>>> > > On May 5, 2014 3:03 PM, "DB Tsai" <[email protected]> wrote:
>>>> > >
>>>> > > > Breeze could take any type (Int, Long, Double, and Float) in the
>>>> matrix
>>>> > > > template.
>>>> > > >
>>>> > > >
>>>> > > > Sincerely,
>>>> > > >
>>>> > > > DB Tsai
>>>> > > > -------------------------------------------------------
>>>> > > > My Blog: https://www.dbtsai.com
>>>> > > > LinkedIn: https://www.linkedin.com/in/dbtsai
>>>> > > >
>>>> > > >
>>>> > > > On Mon, May 5, 2014 at 2:56 PM, Debasish Das <
>>>> [email protected]
>>>> > > > >wrote:
>>>> > > >
>>>> > > > > Is this a breeze issue or breeze can take templates on float /
>>>> > double ?
>>>> > > > >
>>>> > > > > If breeze can take templates then it is a minor fix for
>>>> Vectors.scala
>>>> > > > right
>>>> > > > > ?
>>>> > > > >
>>>> > > > > Thanks.
>>>> > > > > Deb
>>>> > > > >
>>>> > > > >
>>>> > > > > On Mon, May 5, 2014 at 2:45 PM, DB Tsai <[email protected]>
>>>> wrote:
>>>> > > > >
>>>> > > > > > +1  Would be nice that we can use different type in Vector.
>>>> > > > > >
>>>> > > > > >
>>>> > > > > > Sincerely,
>>>> > > > > >
>>>> > > > > > DB Tsai
>>>> > > > > > -------------------------------------------------------
>>>> > > > > > My Blog: https://www.dbtsai.com
>>>> > > > > > LinkedIn: https://www.linkedin.com/in/dbtsai
>>>> > > > > >
>>>> > > > > >
>>>> > > > > > On Mon, May 5, 2014 at 2:41 PM, Debasish Das <
>>>> > > [email protected]
>>>> > > > > > >wrote:
>>>> > > > > >
>>>> > > > > > > Hi,
>>>> > > > > > >
>>>> > > > > > > Why mllib vector is using double as default ?
>>>> > > > > > >
>>>> > > > > > > /**
>>>> > > > > > >
>>>> > > > > > >  * Represents a numeric vector, whose index type is Int and
>>>> value
>>>> > > > type
>>>> > > > > is
>>>> > > > > > > Double.
>>>> > > > > > >
>>>> > > > > > >  */
>>>> > > > > > >
>>>> > > > > > > trait Vector extends Serializable {
>>>> > > > > > >
>>>> > > > > > >
>>>> > > > > > >   /**
>>>> > > > > > >
>>>> > > > > > >    * Size of the vector.
>>>> > > > > > >
>>>> > > > > > >    */
>>>> > > > > > >
>>>> > > > > > >   def size: Int
>>>> > > > > > >
>>>> > > > > > >
>>>> > > > > > >   /**
>>>> > > > > > >
>>>> > > > > > >    * Converts the instance to a double array.
>>>> > > > > > >
>>>> > > > > > >    */
>>>> > > > > > >
>>>> > > > > > >   def toArray: Array[Double]
>>>> > > > > > >
>>>> > > > > > > Don't we need a template on float/double ? This will give us
>>>> > memory
>>>> > > > > > > savings...
>>>> > > > > > >
>>>> > > > > > > Thanks.
>>>> > > > > > >
>>>> > > > > > > Deb
>>>> > > > > > >
>>>> > > > > >
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>>
>>>
>>>

Re: mllib vector templates

Reply via email to