[GitHub] spark issue #14446: [SPARK-16841][SQL] Improves the row level metrics perfor...

clockfly Wed, 03 Aug 2016 10:45:06 -0700

Github user clockfly commented on the issue:

    https://github.com/apache/spark/pull/14446
  
    @davies 
    
    This PR is created after analyzing the performance impact of #12352, which 
added the row level metrics and caused 15% performance regression. And I can 
verify the performance regression consistently by comparing performance of 
d4b94ea and 6f88006. 
    
    But the problem is that I cannot reproduce the same performance regression 
consistently on trunk, the performance improvement after the fix on trunk 
varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The 
phenomenon I observed is that when running the same benchmark code repeatedly 
in same spark shell for 100 times, the time it takes for each run doesn't 
converge.
    
    For example, if we run the below code for 100 times,
    ```
    spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))
    ```
    
    I observed:
    1. For the first run, it may take > 9000 ms
    2. Then for the next few runs, it is much faster, around 4700ms
    3. After that, the performance suddenly becomes worse. It may take around 
8500 ms for each run.
    
    I guess the phenomenon has something to do with Java JIT and our codegen 
logic (Because of codegen, we are creating new class type for each run in 
spark-shell).
    
    As I cannot verify this improvement consistently, I am going to close this 
PR.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #14446: [SPARK-16841][SQL] Improves the row level metrics perfor...

Reply via email to