rednaxelafx commented on issue #24515: [SPARK-14083][WIP] Basic bytecode 
analyzer to speed up Datasets
URL: https://github.com/apache/spark/pull/24515#issuecomment-489241706
 
 
   Thanks for your work, @aokolnychyi and @dbtsai !
   I'm super excited about this PR as a concrete place to start a discussion on 
improving the performance of the existing typed Dataset operations.
   
   I've worked on a continuation of @JoshRosen 's 
[prototype](https://github.com/apache/spark/compare/master...JoshRosen:expression-analysis?diff=unified&name=expression-analysis)
 about two years ago, so I have some first-hand experience on both the 
implementation details and the applicability of this direction.
   I'll be sharing my thoughts on this topic in the coming couple of days. It 
might end up being a long write-up, but please stay tuned!
   
   In the meantime, though, I'd really like to call on the community to share 
their use cases of using the existing typed Dataset operations, so that we'll 
be able to better evaluate how much benefit will this project bring to 
real-world queries.
   
   In a lot of cases, simply moving uses of the typed Dataset operations to the 
equivalent untyped DataFrame operations can substantially speed up queries; 
third-party solutions like [Quill](https://github.com/getquill/quill) can 
provide a typed API for Scala but directly generate untyped operations that are 
fast to begin with (in Quill's case, the generated code is SQL). So for people 
that are not content with the current performance of the typed Dataset 
operations, they already have two directions they can pursue:
   1. Just use untyped DataFrame API. **Pros**: fast; **Cons**: not statically 
typed in the host language (Scala)
   2. Use third-party bindings like Quill. **Pros**: fast and typed; **Cons**: 
doesn't cover all the use cases of the typed Dataset API.
   
   Or, if a query uses bulk operations like `mapPartitions`, then the overhead 
of *could* be negligible.
   
   There are a few cases where users may be forced to use the typed Dataset 
operations, e.g. when they need to use Structured Streaming APIs like 
`mapGroupsWithState` and `flatMapGroupsWithState`. In such scenario, it is 
indeed very important to be able to speed up typed Dataset operations because 
there may not be a good alternative.
   
   So questions to everybody:
   - How much are you using typed Dataset operations?
   - Which operations?
   - What kind of code are you putting into the lambdas for the typed Dataset 
operations?
   
   Thanks!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to