AngersZhuuuu commented on pull request #33828:
URL: https://github.com/apache/spark/pull/33828#issuecomment-1066479723
> * propose using ` "spark.sql.sources.writeJobUUID` as the job id when set;
more uniqueness and it should be set everywhere.
Now all place use spark's job id, I can do this after this pr since it's not
the same thing.
> * core design looks ok. but i don't see why you couldn't support
concurrent jobs just by having different subdirs of __temporary for different
job IDs/UUIDs, and an option to disable cleanup. (and instructions to do it
later, which you'd need to do anyway).
Since if two job write to same table's different partition, the have same
output path ${table_location}/temporary/0....
If one job succeed , it will delete that path, then another job's data is
lossed.
> * because that use of `__temporary/0` on file output committer is only
because on a restart of the MR AM lets the committer use `__temporary/1`
(using app attempt number for the subdir) then moving the committed task data
from job attempt 0 to its own dir, so recover all existing work. spark doesn't
need that.
This is caused that spark still use FileOutputCommitter, still keep this, if
we can rewrite a commit protocol, we can avoid this.
> * it'd be good for you to try out my manifest committer against hdfs with
your workloads. it is designed to be a lot faster in job commit because all
listing of task output directory trees is done in task commit, and job commit
does everything in parallel (listing of manifests, loading of manifests,
creating dest dirs, file rename). some of the options you don't need for hdfs
(parallel delete of task attempt temp dirs)j, but I still expect a massive
speedup of job commit, though not as much as for stores where listing and
rename is slower.
Yea, I will try this later, it's a very useful design and can reduce hdfs's
pressure a lot. I need to check this with our hdfs team too.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]