Github user clockfly commented on the issue:
https://github.com/apache/spark/pull/14446
@davies
This PR is created after analyzing the performance impact of #12352, which
added the row level metrics and caused 15% performance regression. And I can
verify the performance regression consistently by comparing performance of
d4b94ea and 6f88006.
But the problem is that I cannot reproduce the same performance regression
consistently on trunk, the performance improvement after the fix on trunk
varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The
phenomenon I observed is that when running the same benchmark code repeatedly
in same spark shell for 100 times, the time it takes for each run doesn't
converge.
For example, if we run the below code for 100 times,
```
spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))
```
I observed:
1. For the first run, it may take > 9000 ms
2. Then for the next few runs, it is much faster, around 4700ms
3. After that, the performance suddenly becomes worse. It may take around
8500 ms for each run.
I guess the phenomenon has something to do with Java JIT and our codegen
logic (Because of codegen, we are creating new class type for each run in
spark-shell).
As I cannot verify this improvement consistently, I am going to close this
PR.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]