Joseph K. Bradley created SPARK-4285:
----------------------------------------

             Summary: Transpose RDD[Vector] to column store for ML
                 Key: SPARK-4285
                 URL: https://issues.apache.org/jira/browse/SPARK-4285
             Project: Spark
          Issue Type: New Feature
          Components: MLlib
            Reporter: Joseph K. Bradley
            Priority: Minor


For certain ML algorithms, a column store is more efficient than a row store 
(which is currently used everywhere).  E.g., deep decision trees can be faster 
to train when partitioning by features.

Proposal: Provide a method with the following API (probably in util/):
```
def rowToColumnStore(data: RDD[Vector]): RDD[(Int, Vector)]
```
The input Vectors will be data rows/instances, and the output Vectors will be 
columns/features paired with column/feature indices.

**Question**: Is it important to maintain matrix structure?  That is, should 
output Vectors in the same partition be adjacent columns in the matrix?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to