Mllib Logistic Regression performance relative to Mahout

2016-02-26 Thread raj.kumar
Hi,

We are trying to port over some code that uses Mahout Logistic Regression to
Mllib Logistic Regression and our preliminary performance tests indicate a
performance bottleneck. It is not clear to me if this is due to one of three
factors:

o Comparing apples to oranges
o Inadequate tuning
o Insufficient parallelism

The test results and the code that produced the results are below. I am
hoping that someone can shed some light on the performance problem we are
having. 

thanks much
-Raj

P.S. Apologies if this is a duplicate posting. I got a response to a
previous posting that suggested that the posting may not have correctly
registered. 

- Mahout LR vs. Mllib LR -
Data   Cluster   MLLIbMahout
sizetype   Train Test  Rate  Train   Test  Rate
--  -    -   
100local[*]   .03.154  1.111  100
100Cluster[6]   .036   .09  59  1   9100
500,000 local[*]32  983  326   1086   82
500,000 Cluster[6] 8   483  310   877 81

All rates are in records/milliseconds
The 100 dataset is the sample_libsvm_data.txt
My cluster was a set of 6 worker-machines on aws
Rate indicate the % of the test set that were labeled correctly
The latest versions of mllib (1.6) and Mahout (0.9) were used in the tests
 
MllMahout.scala

  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-Logistic-Regression-performance-relative-to-Mahout-tp26346.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Saving and Loading Dataframes

2016-02-25 Thread raj.kumar
Hi,

I am using mllib. I use the ml vectorization tools to create the vectorized
input dataframe for
the ml/mllib machine-learning models with schema:
 
root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)

To avoid repeated vectorization, I am trying to save and load this dataframe
using
   df.write.format("json").mode("overwrite").save( url )
val data = Spark.sqlc.read.format("json").load( url )

However when I load the dataframe, the newly loaded dataframe has the
following schema:
root
 |-- features: struct (nullable = true)
 ||-- indices: array (nullable = true)
 |||-- element: long (containsNull = true)
 ||-- size: long (nullable = true)
 ||-- type: long (nullable = true)
 ||-- values: array (nullable = true)
 |||-- element: double (containsNull = true)
 |-- label: double (nullable = true)

which the machine-learning models do not recognize. 

Is there a way I can save and load this dataframe without the schema
changing. 
I assume it has to do with the fact that Vector is not a basic type. 

thanks
-Raj





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Saving-and-Loading-Dataframes-tp26339.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Dataset Encoders for SparseVector

2016-02-04 Thread raj.kumar
Hi, 

I have a DataFrame df with a column "feature" of type SparseVector that
results from the ml library's VectorAssembler class. 

I'd like to get a Dataset of SparseVectors from this column, but when I do a 

df.as[SparseVector] scala complains that it doesn't know of an encoder for
SparseVector. If I then try to implement the Encoder[T] interface for
SparseVector I get the error
"java.lang.RuntimeException: Only expression encoders are supported today"

How can I get a Dataset[SparseVector] from the output of VectorAssembler?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Dataset-Encoders-for-SparseVector-tp26149.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org