[
https://issues.apache.org/jira/browse/SPARK-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15406306#comment-15406306
]
Sean Zhong edited comment on SPARK-16841 at 8/3/16 5:54 PM:
------------------------------------------------------------
This jira is created after analyzing the performance impact of
https://github.com/apache/spark/pull/12352, which added the row level metrics
and caused 15% performance regression. And I can verify the performance
regression consistently by comparing performance code before and after
https://github.com/apache/spark/pull/12352.
But the problem is that I cannot reproduce the same performance regression
consistently on Spark trunk code, the performance improvement after the fix on
trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The
phenomenon I observed is that when running the same benchmark code repeatedly
in same spark shell for 100 times, the time it takes for each run doesn't
converge, and I cannot get an exact performance number.
For example, if we run the below code for 100 times,
{code}
spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))
{code}
I observed:
1. For the first run, it may take > 9000 ms
2. Then for the next few runs, it is much faster, around 4700ms
3. After that, the performance suddenly becomes worse. It may take around 8500
ms for each run.
I guess the phenomenon has something to do with Java JIT and our codegen logic
(Because of codegen, we are creating new class type for each run in
spark-shell, which may impact code cache).
Since I cannot verify this improvement consistently on trunk, I am going to
close this jira.
was (Author: clockfly):
This PR is created after analyzing the performance impact of
https://github.com/apache/spark/pull/12352, which added the row level metrics
and caused 15% performance regression. And I can verify the performance
regression consistently by comparing performance code before and after
https://github.com/apache/spark/pull/12352.
But the problem is that I cannot reproduce the same performance regression
consistently on Spark trunk code, the performance improvement after the fix on
trunk varies a lot (sometimes 5%, sometimes 20%, sometimes not obvious). The
phenomenon I observed is that when running the same benchmark code repeatedly
in same spark shell for 100 times, the time it takes for each run doesn't
converge, and I cannot get an exact performance number.
For example, if we run the below code for 100 times,
{code}
spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))
{code}
I observed:
1. For the first run, it may take > 9000 ms
2. Then for the next few runs, it is much faster, around 4700ms
3. After that, the performance suddenly becomes worse. It may take around 8500
ms for each run.
I guess the phenomenon has something to do with Java JIT and our codegen logic
(Because of codegen, we are creating new class type for each run in
spark-shell, which may impact code cache).
Since I cannot verify this improvement consistently on trunk, I am going to
close this jira.
> Improves the row level metrics performance when reading Parquet table
> ---------------------------------------------------------------------
>
> Key: SPARK-16841
> URL: https://issues.apache.org/jira/browse/SPARK-16841
> Project: Spark
> Issue Type: Improvement
> Reporter: Sean Zhong
>
> When reading Parquet table, Spark adds row level metrics like recordsRead,
> bytesRead
> (https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L93).
> The implementation is not very efficient. When parquet vectorized reader is
> not used, it may take 20% of read time to update these metrics.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]