koertkuipers opened a new pull request #26560: [SPARK-29905] Reading of csv file fails with adaptive execution turned on URL: https://github.com/apache/spark/pull/26560 ### What changes were proposed in this pull request? Switch to rdd to avoid adaptive execution when reading first line, because it seems adaptive execution can insert a ShuffleMapStage after which the first line read is not necessarily the csv header anymore. Note that this is a workaround. It is not clear to me why adaptive exection inserts a shuffle stage when all we are doing is filtering and reading first line. If this is unwanted behavior then the fix should be in adaptive execution. ### Why are the changes needed? Without this change spark can read csv files incorrectly when adaptive execution is turned on resulting in errors and/or data corruption. ### Does this PR introduce any user-facing change? No ### How was this patch tested? It is not clear to me how reproduce issue in a unit test, because so far the issue only shows up with 6+ executors and decent sized files. This patch was only tested by rebuilding spark and re-running the steps as decribed in the jira and confirming the issue did not show up again.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
