Re: [PR] [SPARK-50854][SS] Make path fully qualified before passing it to FileStreamSink [spark]

via GitHub Mon, 10 Feb 2025 05:41:25 -0800


HeartSaVioR commented on PR #49654:
URL: https://github.com/apache/spark/pull/49654#issuecomment-2648023210


   I've finally found a time to play with this. Sorry, I don't deal with this 
kind of setup for daily job.
   
   I see the default setup from Spark distribution also sets CWD differently 
between driver and worker. metadata files contain the path (despite it's stored 
outside) correctly, but Spark file reader can't still read the data files 
properly.
   
   I also checked quickly with batch query and saw batch file writer is writing 
files into the path for driver. I'm yet to read the code path for batch, but 
probably the correction seems to be happening from moving temp file to the 
final path.
   
   I'll take a look at the PR, but I'd love to be conservative about behavioral 
change. Spark is the project many companies and individuals are relying on to 
make revenue, and many of them have claimed "bug as a spec" when upgrading 
broke their workload.
   
   Could you please add the sink option in FileStreamSink, like 
"doNotQualifyRelativePathInDriver", with default "false"? I'd also like to see 
this option be updated in SS doc. Please include the option in 
docs/streaming/apis-on-dataframes-and-datasets.md.
   
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-50854][SS] Make path fully qualified before passing it to FileStreamSink [spark]

Reply via email to