[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

viirya Thu, 04 May 2017 18:05:26 -0700

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    The bunch of projections will be collapsed in optimization. So it doesn't 
affect query execution. However, every `withColumn` call creates new 
`DataFrame` along with a projection on previous logical plan. It is costly by 
creating new query execution, analyzing logical plan, creating encoder, etc. So 
the improvement is coming from saving the cost by doing this one time with 
`withColumns`, instead of multiple `withColumn`.
    
    It can benefit other transformers that could work on multiple cols. I even 
have an idea to revamp the interface of `Transformer`. Because the 
transformation in `Transformer` is actually ending with a `withColumn` call to 
add/replace column. They are actually transforming columns in the dataset. But 
the performance difference is obvious only when the number of transformation 
stages is large enough like the example of many `Bucketizer`s. So it may not 
worth doing that. Just a thought.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add a Bucketizer that can bin mul...

Reply via email to