Hi,

Sorry If I am being noisy, but I wanted to grab your attention at
SPARK-50994 <https://issues.apache.org/jira/browse/SPARK-50994>.
It was raised because when `Dataset` is converted into `RDD`, It executes
`SpakPlan` without any execution context. This leads to:

   1. No tracking is available on Spark UI for stages which are necessary
   to build the `RDD`.
   2. Spark properties which are local to thread may not be set in the
   `RDD` execution context. This leads to these properties not being sent with
   `TaskContext` but some operations like reading parquet files depend on
   these properties (eg, case-sensitivity).


#2 can lead to data Correctness issues. See the testcase added in the PR
<https://github.com/apache/spark/pull/49678>, current version provides
incorrect values for dedup operation.
I also feel that #1 is also useful since operations performed before RDD
conversion are not traceable on Spark UI ATM.

Thanks.

Reply via email to