koertkuipers opened a new pull request #26560: [SPARK-29905] Reading of csv 
file fails with adaptive execution turned on
URL: https://github.com/apache/spark/pull/26560
 
 
   ### What changes were proposed in this pull request?
   Switch to rdd to avoid adaptive execution when reading first line, because 
it seems adaptive execution can insert a ShuffleMapStage after which the first 
line read is not necessarily the csv header anymore.
   
   Note that this is a workaround. It is not clear to me why adaptive exection 
inserts a shuffle stage when all we are doing is filtering and reading first 
line. If this is unwanted behavior then the fix should be in adaptive execution.
   
   ### Why are the changes needed?
   
   Without this change spark can read csv files incorrectly when adaptive 
execution is turned on resulting in errors and/or data corruption.
   
   ### Does this PR introduce any user-facing change?
   
   No
   
   ### How was this patch tested?
   
   It is not clear to me how reproduce issue in a unit test, because so far the 
issue only shows up with 6+ executors and decent sized files. This patch was 
only tested by rebuilding spark and re-running the steps as decribed in the 
jira and confirming the issue did not show up again.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to