[GitHub] [spark] sarutak opened a new pull request #34765: [SPARK-37487][SQL][CORE] Avoid performing CollectMetrics twice if the operation is folloed by global sort.

GitBox Tue, 30 Nov 2021 22:27:22 -0800


sarutak opened a new pull request #34765:
URL: https://github.com/apache/spark/pull/34765

### What changes were proposed in this pull request?

This PR fixes an issue that `CollectMetrics` performs twice if it's followed
by global sort like as follows.
```
val df = spark.range(100)
.observe(
name = "my_event",
min($"id").as("min_val"),
max($"id").as("max_val"),
sum($"id"),
count(when($"id" % 2 === 0, 1)).as("num_even"))
.sort($"id".desc)
```

The expected statistics calculated by `CollectMetrics` is `[0,99,4950,50]`
but the actual result is `[0,99,9900,100]`.
The reason is that jobs for sampling can run before the global sort, which
performs extra `CollectMetrics`.

https://github.com/apache/spark/blob/e7fa28930dce468df02b5915e1792ada758a96e3/core/src/main/scala/org/apache/spark/Partitioner.scala#L171

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L195

The solution this PR proposes to introduce a property
`spark.job.isSamplingJob` which is intended to be get/set internally.
Before the sampling jobs run, Spark sets the property, and reset it after
the jobs finish.
Then, `CollectMetrics` can judge a task is whether of a sampling job or not.

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sarutak opened a new pull request #34765: [SPARK-37487][SQL][CORE] Avoid performing CollectMetrics twice if the operation is folloed by global sort.

Reply via email to