[GitHub] [spark] steveloughran commented on issue #25863: [SPARK-28945][SPARK-29037][CORE][SQL] Fix the issue that spark gives duplicate result and support concurrent file source write operations write to different partitions in the same table.

GitBox Mon, 30 Sep 2019 07:04:59 -0700

steveloughran commented on issue #25863: [SPARK-28945][SPARK-29037][CORE][SQL]
Fix the issue that spark gives duplicate result and support concurrent file
source write operations write to different partitions in the same table.
URL: https://github.com/apache/spark/pull/25863#issuecomment-536576759

I've looked at the code a bit more. As I noted earlier, this scares me.

in the FileOutputCommitter, the FS itself is synchronization point, with
assumptions about atomicity and performance implicitly implemented in the code.
The application which use the committer have their own assumptions about
atomicity and performance which are derived transitively from those of the file
system and then extended by assumptions about the correctness of the algorithms.

Things are bad enough as they are.

I am not convinced that relying on the internals of FileOutputCommitter
Versions is the safe way to do this.

I think you better off specifying a commit protocol which is explicitly
designed for writing files into the destination, and then implementing it. For
S3A, knowing the destination path lets is initiate but not complete the upload
to the final path; we would then propagate the information needed to manifest
that file to the job committer. The "Magic" committer does exactly this by
recognising that when someone writes to
`dest/__magic/$job_attemptID/$task-attempt-id/__basepath/year=2019/month=10/day=1/part-0001.csv`
is to be have a final destination of
`dest/year=2019/month=10/day=1/part-0001.csv` and the output stream, rather
than create that file o close(), is only to save the metadata for the task
commit to find.

That is very much black magic in the FS connector. Having a commit protol
where you could ask the committer for an output path to use when writing to a
specific destination we let you eliminate that trick completely and possibly
help with this problem.

The other thing to consider is that spark permits committed tasks to pass
serialized data back to the driver, as an alternative to relying on the file
system. Scale issues notwithstanding, task committers should be able to provide
the information needed to include the task in the jobs final output. For
example, rather have task commit renaming files, it could just return a path
too where it has stored its list of files to commit. Job commit becomes a
matter of loading those files and moving the output into the final destination.
Again, this is essentially what we do with the S3A committer -we just save,
propagate and reload those files within the committer.

This all gets complicated fast -even without worrying about concurrent jobs.
But trying to use `FileOutputCommitter` internals to do what you want to do
here strikes me is very very dangerous. I have issues with one of the commit
algorithms implemented in it already; trying to use it for other purposes is
only going to make things worse.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] steveloughran commented on issue #25863: [SPARK-28945][SPARK-29037][CORE][SQL] Fix the issue that spark gives duplicate result and support concurrent file source write operations write to different partitions in the same table.

Reply via email to