HeartSaVioR commented on PR #49654: URL: https://github.com/apache/spark/pull/49654#issuecomment-2648023210
I've finally found a time to play with this. Sorry, I don't deal with this kind of setup for daily job. I see the default setup from Spark distribution also sets CWD differently between driver and worker. metadata files contain the path (despite it's stored outside) correctly, but Spark file reader can't still read the data files properly. I also checked quickly with batch query and saw batch file writer is writing files into the path for driver. I'm yet to read the code path for batch, but probably the correction seems to be happening from moving temp file to the final path. I'll take a look at the PR, but I'd love to be conservative about behavioral change. Spark is the project many companies and individuals are relying on to make revenue, and many of them have claimed "bug as a spec" when upgrading broke their workload. Could you please add the sink option in FileStreamSink, like "doNotQualifyRelativePathInDriver", with default "false"? I'd also like to see this option be updated in SS doc. Please include the option in docs/streaming/apis-on-dataframes-and-datasets.md. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
