[GitHub] [spark] rednaxelafx commented on issue #24515: [SPARK-14083][WIP] Basic bytecode analyzer to speed up Datasets

GitBox Fri, 03 May 2019 14:13:36 -0700

rednaxelafx commented on issue #24515: [SPARK-14083][WIP] Basic bytecode
analyzer to speed up Datasets
URL: https://github.com/apache/spark/pull/24515#issuecomment-489241706

Thanks for your work, @aokolnychyi and @dbtsai !
I'm super excited about this PR as a concrete place to start a discussion on
improving the performance of the existing typed Dataset operations.

I've worked on a continuation of @JoshRosen 's
[prototype](https://github.com/apache/spark/compare/master...JoshRosen:expression-analysis?diff=unified&name=expression-analysis)
about two years ago, so I have some first-hand experience on both the
implementation details and the applicability of this direction.
I'll be sharing my thoughts on this topic in the coming couple of days. It
might end up being a long write-up, but please stay tuned!

In the meantime, though, I'd really like to call on the community to share
their use cases of using the existing typed Dataset operations, so that we'll
be able to better evaluate how much benefit will this project bring to
real-world queries.

In a lot of cases, simply moving uses of the typed Dataset operations to the
equivalent untyped DataFrame operations can substantially speed up queries;
third-party solutions like [Quill](https://github.com/getquill/quill) can
provide a typed API for Scala but directly generate untyped operations that are
fast to begin with (in Quill's case, the generated code is SQL). So for people
that are not content with the current performance of the typed Dataset
operations, they already have two directions they can pursue:
1. Just use untyped DataFrame API. **Pros**: fast; **Cons**: not statically
typed in the host language (Scala)
2. Use third-party bindings like Quill. **Pros**: fast and typed; **Cons**:
doesn't cover all the use cases of the typed Dataset API.

Or, if a query uses bulk operations like `mapPartitions`, then the overhead
of *could* be negligible.

There are a few cases where users may be forced to use the typed Dataset
operations, e.g. when they need to use Structured Streaming APIs like
`mapGroupsWithState` and `flatMapGroupsWithState`. In such scenario, it is
indeed very important to be able to speed up typed Dataset operations because
there may not be a good alternative.

So questions to everybody:
- How much are you using typed Dataset operations?
- Which operations?
- What kind of code are you putting into the lambdas for the typed Dataset
operations?

Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] rednaxelafx commented on issue #24515: [SPARK-14083][WIP] Basic bytecode analyzer to speed up Datasets

Reply via email to