[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632061#comment-14632061
 ] 

Nick Buroojy commented on SPARK-8418:
-------------------------------------

I like this idea a lot, and think it would solve one of our main performance 
issues with the ml api.

Our data set has hundreds of string features that we need to convert into 
binary vectors. We have found the latency overhead of processing the features 
one at-a-time with a StringVectorizer (SPARK-7290) to be unbearable. We wrote a 
custom Estimator to vectorize all string columns with only a couple passes over 
the data set and found significant performance gains.

I suspect that we aren't the only users with many columns, so we would love to 
fix this issue upstream with some sort of multi-column interface to 
transformers and estimators.

I suppose we could make do with the Vector or Array interface using the 
VectorAssembler as described in this ticket; however, I think the cleanest 
interface for us would be a Map from source column to dest column.

As far as sharing code, there are at least two strategies:
1) Use the single value implementation as it is today, and add a multi-value 
view on top of it. For example, StringVectorizer.setInputCols(Array[A, B]) 
would return a pipeline of [StringVectorizer.setInputCol(A), 
StringVectorizer(B)]
2) Reimplement each transformer to support a multi-value implementation and 
make the single-value interface a trivial invocation of the multi-value code. 
For example StringVectorizer.setInputCol(A) would invoke 
StringVectorizer.setInputCols(Array[A])

The obvious downside of 1 is that it wouldn't address the performance issues we 
ran into with hundreds of columns. The upsides are minimal implementation 
effort and simpler code to maintain.

The main downside of 2 is more upfront effort to implement multi-value 
transformations, but the upside is reasonable performance with "wide" data sets.

I don't think 1 and 2 are mutually exclusive. Maybe the multi-value interface 
could be solidified first with the 1 implementation, then over time the key 
transformers, like StringVectorizer, could be rewritten to 2?

You mentioned that this would require a short design doc. Can I help with that?

> Add single- and multi-value support to ML Transformers
> ------------------------------------------------------
>
>                 Key: SPARK-8418
>                 URL: https://issues.apache.org/jira/browse/SPARK-8418
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>            Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to