task commit at scale

Steve Loughran (Jira) Thu, 19 Sep 2019 05:18:06 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-16570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933323#comment-16933323
 ]


Steve Loughran commented on HADOOP-16570:
-----------------------------------------

h2. Plan

h3. now: 

* listPendingUploads lists all pendingset files, loads these JSON files (in 
separate threasds)  to produce a single list of all files to commit or abort.
* commit/abort is done in the thread pool
* _SUCCESS file lists all the written files; gets used in tests to verify that 
(a) number of files >0 and b that the filenames match the store state, expected 
values, etc.

h3. proposed

* list all .pendingset files
* hand off load and commit/abort to the threads so the no. of actively loaded 
files is limited to the #of active threads.
* limit size of success file to first, say, 500 entries; a counter field will 
be updated to give the final number.  This will be enough for all integration 
tests that use the file that I know of.

h2. Troublespots

h3. Handling failure to load a file, i.e rolling back the commit

* We currently encounter all failures to load any file before a single upload 
has been committed. With incremental load and commit, that doesn't hold any 
more and it will fail partway through the operation.
* we currently roll back failures to commit during the phase where we 
completely uploads, by deleting the files. For that we need the entire list of 
committed files.

h3. Partition directory output

The partitioned committer uses the list of pending uploads to identify leaf 
directories and apply its policy to them; 

* the Fail policy fails to commit before a single file has been written.
* the Replace policy deletes all the files in those directories
* the append policy doesn't care.

The only way to implement the same checks with incremental loads Will be to do 
an initial scan to build up the tree of leaf directories and then apply the 
chosen policy.

Together this implies we'll probably have to do an initial preload scan of all 
pending files, at least for that partitioned committer. It's the one where we 
don't want to write a single file if there are problems, and we need to build 
that tree up.

The other committers can react to failures during incremental commits more 
simply:

1. Abort all pending MPUs under the output directory.
2. Delete all files under the output directory.

I'll have to think about how best to restructure the code to do this, but it is 
possible.


> S3A committers leak threads/raises OOM on job/task commit at scale
> ------------------------------------------------------------------
>
>                 Key: HADOOP-16570
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16570
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.2.0, 3.1.2
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>
> The fixed size ThreadPool created in AbstractS3ACommitter doesn't get cleaned 
> up at EOL; as a result you leak the no. of threads set in 
> "fs.s3a.committer.threads"
> Not visible in MR/distcp jobs, but ultimately causes OOM on Spark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-16570) S3A committers leak threads/raises OOM on job/task commit at scale

Reply via email to