Joseph K. Bradley created SPARK-4285:
----------------------------------------
Summary: Transpose RDD[Vector] to column store for ML
Key: SPARK-4285
URL: https://issues.apache.org/jira/browse/SPARK-4285
Project: Spark
Issue Type: New Feature
Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor
For certain ML algorithms, a column store is more efficient than a row store
(which is currently used everywhere). E.g., deep decision trees can be faster
to train when partitioning by features.
Proposal: Provide a method with the following API (probably in util/):
```
def rowToColumnStore(data: RDD[Vector]): RDD[(Int, Vector)]
```
The input Vectors will be data rows/instances, and the output Vectors will be
columns/features paired with column/feature indices.
**Question**: Is it important to maintain matrix structure? That is, should
output Vectors in the same partition be adjacent columns in the matrix?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]