[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

thunterdb Thu, 30 Mar 2017 16:43:27 -0700

Github user thunterdb commented on the issue:

    https://github.com/apache/spark/pull/17419
  
    I looked a bit deeper into the performance aspect. Here are some quick 
insights:
     - there was an immediate bottleneck in `VectorUDT`, which boosts the 
performance already by 3x
     - it is not clear if switching to pure Breeze operations helps given the 
overhead for tiny vectors. I will need to do more analysis on larger vectors.
     - now, most of the time is roughly split between 
`ObjectAggregationIterator.processInputs` (40%), some codegen'ed expression 
(20%) and our own `MetricsAggregate.update` (35%)
    
    That benchmark focuses on the overhead of catalyst. I will do another 
benchmark with dense vectors to see how it fares in practice with more real 
data.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

Reply via email to