V Luong commented on SPARK-23333:

[~cloud_fan] there are many scenarios in which oldDF involves sorting in its 
plan, e.g. if certain feature columns are calculated using windowed functions. 
In general, it would be a pain to always make sure that oldDF doesn't involve 
sorting (e.g. by checkpointing to files) prior to VectorAssembler. Anyway, 
VectorAssembler metadata shouldn't strictly need the first row.

> SparkML VectorAssembler.transform slow when needing to invoke .first() on 
> sorted DataFrame
> ------------------------------------------------------------------------------------------
>                 Key: SPARK-23333
>                 URL: https://issues.apache.org/jira/browse/SPARK-23333
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, MLlib, SQL
>    Affects Versions: 2.2.1
>            Reporter: V Luong
>            Priority: Minor
> Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes 
> oldDF.first() in order to establish some metadata/attributes: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.]
>  When oldDF is sorted, the above triggering of oldDF.first() can be very slow.
> For the purpose of establishing metadata, taking an arbitrary row from oldDF 
> will be just as good as taking oldDF.first(). Is there hence a way we can 
> speed up a great deal by somehow grabbing a random row, instead of relying on 
> oldDF.first()?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to