[GitHub] [spark] AngersZhuuuu commented on pull request #33828: [SPARK-36579][CORE][SQL] Make spark source stagingDir can be customized

GitBox Mon, 14 Mar 2022 00:57:55 -0700


AngersZhuuuu commented on pull request #33828:
URL: https://github.com/apache/spark/pull/33828#issuecomment-1066479723



   > * propose using ` "spark.sql.sources.writeJobUUID` as the job id when set; 
more uniqueness and it should be set everywhere.
   
   Now all place use spark's job id, I can do this after this pr since it's not 
the same thing.
   
   > * core design looks ok. but i don't see why you couldn't support 
concurrent jobs just by having different subdirs of __temporary for different 
job IDs/UUIDs, and an option to disable cleanup. (and instructions to do it 
later, which you'd need to do anyway).
   
   Since if two job write to same table's different partition, the have same 
output path ${table_location}/temporary/0....
   If one job succeed , it will delete that path, then another job's data is 
lossed.
   
   > * because that use of `__temporary/0` on file output committer is only 
because on a restart of the MR AM lets the committer use `__temporary/1`  
(using app attempt number for the subdir) then moving the committed task data 
from job attempt 0 to its own dir, so recover all existing work. spark doesn't 
need that.
   
   This is caused that spark still use FileOutputCommitter, still keep this, if 
we can rewrite a commit protocol, we can avoid this.
   
   > * it'd be good for you to try out my manifest committer against hdfs with 
your workloads. it is designed to be a lot faster in job commit because all 
listing of task output directory trees is done in task commit, and job commit 
does everything in parallel (listing of manifests, loading of manifests, 
creating dest dirs, file rename). some of the options you don't need for hdfs 
(parallel delete of task attempt temp dirs)j, but I still expect a massive 
speedup of job commit, though not as much as for stores where listing and 
rename is slower.
   
   Yea, I will try this later, it's a very useful design and can reduce hdfs's 
pressure a lot. I need to check this with our hdfs team too.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] AngersZhuuuu commented on pull request #33828: [SPARK-36579][CORE][SQL] Make spark source stagingDir can be customized

Reply via email to