SPARK-50994: Perform RDD conversion under tracked execution

Harsh Panchal Mon, 17 Feb 2025 09:03:14 -0800

Hi,

Sorry If I am being noisy, but I wanted to grab your attention at
SPARK-50994 <https://issues.apache.org/jira/browse/SPARK-50994>.
It was raised because when `Dataset` is converted into `RDD`, It executes
`SpakPlan` without any execution context. This leads to:


   1. No tracking is available on Spark UI for stages which are necessary
   to build the `RDD`.
   2. Spark properties which are local to thread may not be set in the
   `RDD` execution context. This leads to these properties not being sent with
   `TaskContext` but some operations like reading parquet files depend on
   these properties (eg, case-sensitivity).


#2 can lead to data Correctness issues. See the testcase added in the PR
<https://github.com/apache/spark/pull/49678>, current version provides
incorrect values for dedup operation.
I also feel that #1 is also useful since operations performed before RDD
conversion are not traceable on Spark UI ATM.

Thanks.

SPARK-50994: Perform RDD conversion under tracked execution

Reply via email to