[
https://issues.apache.org/jira/browse/HADOOP-16570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933323#comment-16933323
]
Steve Loughran commented on HADOOP-16570:
-----------------------------------------
h2. Plan
h3. now:
* listPendingUploads lists all pendingset files, loads these JSON files (in
separate threasds) to produce a single list of all files to commit or abort.
* commit/abort is done in the thread pool
* _SUCCESS file lists all the written files; gets used in tests to verify that
(a) number of files >0 and b that the filenames match the store state, expected
values, etc.
h3. proposed
* list all .pendingset files
* hand off load and commit/abort to the threads so the no. of actively loaded
files is limited to the #of active threads.
* limit size of success file to first, say, 500 entries; a counter field will
be updated to give the final number. This will be enough for all integration
tests that use the file that I know of.
h2. Troublespots
h3. Handling failure to load a file, i.e rolling back the commit
* We currently encounter all failures to load any file before a single upload
has been committed. With incremental load and commit, that doesn't hold any
more and it will fail partway through the operation.
* we currently roll back failures to commit during the phase where we
completely uploads, by deleting the files. For that we need the entire list of
committed files.
h3. Partition directory output
The partitioned committer uses the list of pending uploads to identify leaf
directories and apply its policy to them;
* the Fail policy fails to commit before a single file has been written.
* the Replace policy deletes all the files in those directories
* the append policy doesn't care.
The only way to implement the same checks with incremental loads Will be to do
an initial scan to build up the tree of leaf directories and then apply the
chosen policy.
Together this implies we'll probably have to do an initial preload scan of all
pending files, at least for that partitioned committer. It's the one where we
don't want to write a single file if there are problems, and we need to build
that tree up.
The other committers can react to failures during incremental commits more
simply:
1. Abort all pending MPUs under the output directory.
2. Delete all files under the output directory.
I'll have to think about how best to restructure the code to do this, but it is
possible.
> S3A committers leak threads/raises OOM on job/task commit at scale
> ------------------------------------------------------------------
>
> Key: HADOOP-16570
> URL: https://issues.apache.org/jira/browse/HADOOP-16570
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: 3.2.0, 3.1.2
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Major
>
> The fixed size ThreadPool created in AbstractS3ACommitter doesn't get cleaned
> up at EOL; as a result you leak the no. of threads set in
> "fs.s3a.committer.threads"
> Not visible in MR/distcp jobs, but ultimately causes OOM on Spark
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]