[jira] [Comment Edited] (SPARK-7529) Java compatibility check for MLlib 1.4

Joseph K. Bradley (JIRA) Mon, 01 Jun 2015 13:00:04 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-7529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562199#comment-14562199
 ]


Joseph K. Bradley edited comment on SPARK-7529 at 6/1/15 7:58 PM:
------------------------------------------------------------------

*spark.mllib: Issues found in a pass through the spark.mllib package*

h3. Classification

LogisticRegressionModel + SVMModel
* scala.Option<Object>  getThreshold() *--> Old API. Make Java version?*

h3. Clustering

DistributedLDAModel
* RDD<scala.Tuple2<Object,Vector>>      topicDistributions() *--> TARGET 1.4: 
create Java version*

GaussianMixtureModel
* RDD<Object>   predict(RDD<Vector> points) *--> TARGET 1.4: create Java 
versions*

StreamingKMeans *--> TARGET 1.4: create Java versions, following logreg example*
* DStream<Object>       predictOn(DStream<Vector> data)
* <K> DStream<scala.Tuple2<K,Object>>   
predictOnValues(DStream<scala.Tuple2<K,Vector>> data, scala.reflect.ClassTag<K> 
evidence$1)

h3. Evaluation

BinaryClassificationMetrics *--> Old API.  Leave for now.  Fix via Pipelines 
API*
* LOTS (everything taking/returning an RDD)

h3. Feature

Word2VecModel
* scala.Tuple2<String,Object>[] findSynonyms  *--> Old API.  Fix with Java 
version?*

h3. Linalg

*All Target 1.5*

SparseMatrix
* static SparseMatrix   fromCOO(int numRows, int numCols, 
scala.collection.Iterable<scala.Tuple3<Object,Object,Object>> entries)  *--> 
Fix with Java version?*

Vectors
* static Vector sparse(int size, 
scala.collection.Seq<scala.Tuple2<Object,Object>> elements)  *--> Java version?*

BlockMatrix  *--> Fix with Java version*
* RDD<scala.Tuple2<scala.Tuple2<Object,Object>,Matrix>> blocks()
** _This issue appears in the constructors too._

h3. Optimization

_(lower priority b/c DeveloperApi which needs to be updated anyways)_

Optimizer
* Vector        optimize(RDD<scala.Tuple2<Object,Vector>> data, Vector 
initialWeights)
* _Same issue appears elsewhere, wherever Double is used in a tuple._

Gradient
* scala.Tuple2<Vector,Object>   compute(Vector data, double label, Vector 
weights)

h3. Recommendation

MatrixFactorizationModel *--> Target 1.5: Decide how to fix*
* _constructor_: MatrixFactorizationModel(int rank, 
RDD<scala.Tuple2<Object,double[]>> userFeatures, 
RDD<scala.Tuple2<Object,double[]>> productFeatures)
* RDD<scala.Tuple2<Object,double[]>>    productFeatures()
* RDD<scala.Tuple2<Object,Rating[]>>    recommendProductsForUsers(int num)
* RDD<scala.Tuple2<Object,Rating[]>>    recommendUsersForProducts(int num)
* RDD<scala.Tuple2<Object,double[]>>    userFeatures()

h3. Stats

Statistics *--> TARGET 1.4: Java versions*
* static double corr(RDD<Object> x, RDD<Object> y)
* static double corr(RDD<Object> x, RDD<Object> y, String method)

h3. Trees

DecisionTreeModel
* JavaRDD<Object>       predict(JavaRDD<Vector> features)  *--> Old API.*
** _This is because we use Double instead of java.lang.Double (unlike in, e.g., 
TreeEnsembleModel._
** _Users can use spark.ml API anyways._

Split _(low priority; use spark.ml API instead)_
* scala.collection.immutable.List<Object>       categories()

h3. False positives

DataValidators  *--> OK for now*
* static scala.Function1<RDD<LabeledPoint>,Object>      binaryLabelValidator()
* static scala.Function1<RDD<LabeledPoint>,Object>      multiLabelValidator(int 
k)



was (Author: josephkb):
*spark.mllib: Issues found in a pass through the spark.mllib package*

h3. Classification

LogisticRegressionModel + SVMModel
* scala.Option<Object>  getThreshold() *--> Old API. Make Java version?*

h3. Clustering

DistributedLDAModel
* RDD<scala.Tuple2<Object,Vector>>      topicDistributions() *--> TARGET 1.4: 
create Java version*

GaussianMixtureModel + NaiveBayesModel
* RDD<Object>   predict(RDD<Vector> points) *--> TARGET 1.4: create Java 
versions*

StreamingKMeans *--> TARGET 1.4: create Java versions, following logreg example*
* DStream<Object>       predictOn(DStream<Vector> data)
* <K> DStream<scala.Tuple2<K,Object>>   
predictOnValues(DStream<scala.Tuple2<K,Vector>> data, scala.reflect.ClassTag<K> 
evidence$1)

h3. Evaluation

BinaryClassificationMetrics *--> Old API.  Leave for now.  Fix via Pipelines 
API*
* LOTS (everything taking/returning an RDD)

h3. Feature

Word2VecModel
* scala.Tuple2<String,Object>[] findSynonyms  *--> Old API.  Fix with Java 
version?*

h3. Linalg

*All Target 1.5*

SparseMatrix
* static SparseMatrix   fromCOO(int numRows, int numCols, 
scala.collection.Iterable<scala.Tuple3<Object,Object,Object>> entries)  *--> 
Fix with Java version?*

Vectors
* static Vector sparse(int size, 
scala.collection.Seq<scala.Tuple2<Object,Object>> elements)  *--> Java version?*

BlockMatrix  *--> Fix with Java version*
* RDD<scala.Tuple2<scala.Tuple2<Object,Object>,Matrix>> blocks()
** _This issue appears in the constructors too._

h3. Optimization

_(lower priority b/c DeveloperApi which needs to be updated anyways)_

Optimizer
* Vector        optimize(RDD<scala.Tuple2<Object,Vector>> data, Vector 
initialWeights)
* _Same issue appears elsewhere, wherever Double is used in a tuple._

Gradient
* scala.Tuple2<Vector,Object>   compute(Vector data, double label, Vector 
weights)

h3. Recommendation

MatrixFactorizationModel *--> Target 1.5: Decide how to fix*
* _constructor_: MatrixFactorizationModel(int rank, 
RDD<scala.Tuple2<Object,double[]>> userFeatures, 
RDD<scala.Tuple2<Object,double[]>> productFeatures)
* RDD<scala.Tuple2<Object,double[]>>    productFeatures()
* RDD<scala.Tuple2<Object,Rating[]>>    recommendProductsForUsers(int num)
* RDD<scala.Tuple2<Object,Rating[]>>    recommendUsersForProducts(int num)
* RDD<scala.Tuple2<Object,double[]>>    userFeatures()

h3. Stats

Statistics *--> TARGET 1.4: Java versions*
* static double corr(RDD<Object> x, RDD<Object> y)
* static double corr(RDD<Object> x, RDD<Object> y, String method)

h3. Trees

DecisionTreeModel
* JavaRDD<Object>       predict(JavaRDD<Vector> features)  *--> Old API.*
** _This is because we use Double instead of java.lang.Double (unlike in, e.g., 
TreeEnsembleModel._
** _Users can use spark.ml API anyways._

Split _(low priority; use spark.ml API instead)_
* scala.collection.immutable.List<Object>       categories()

h3. False positives

DataValidators  *--> OK for now*
* static scala.Function1<RDD<LabeledPoint>,Object>      binaryLabelValidator()
* static scala.Function1<RDD<LabeledPoint>,Object>      multiLabelValidator(int 
k)


> Java compatibility check for MLlib 1.4
> --------------------------------------
>
>                 Key: SPARK-7529
>                 URL: https://issues.apache.org/jira/browse/SPARK-7529
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML, MLlib
>    Affects Versions: 1.4.0
>            Reporter: Xiangrui Meng
>            Assignee: Joseph K. Bradley
>
> Check Java compatibility for MLlib 1.4. We should create separate JIRAs for 
> each possible issue.
> Checking compatibility means:
> * comparing with the Scala doc
> * verifying that Java docs are not messed up by Scala type incompatibilities. 
>  Some items to look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> ** Check Scala objects (especially with nesting!) carefully.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (The correctness can be checked in Scala.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-7529) Java compatibility check for MLlib 1.4

Reply via email to