Github user thunterdb commented on the issue:
https://github.com/apache/spark/pull/17419
I looked a bit deeper into the performance aspect. Here are some quick
insights:
- there was an immediate bottleneck in `VectorUDT`, which boosts the
performance already by 3x
- it is not clear if switching to pure Breeze operations helps given the
overhead for tiny vectors. I will need to do more analysis on larger vectors.
- now, most of the time is roughly split between
`ObjectAggregationIterator.processInputs` (40%), some codegen'ed expression
(20%) and our own `MetricsAggregate.update` (35%)
That benchmark focuses on the overhead of catalyst. I will do another
benchmark with dense vectors to see how it fares in practice with more real
data.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]