[ 
https://issues.apache.org/jira/browse/HADOOP-14028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-14028:
------------------------------------
    Attachment: HADOOP-14028-branch-2-001.patch

Patch 001

# cleanup takes place in the block close() call
# pass down index (for logging) and stream statistics (for counting and tests)
# {{ITestS3AHugeFiles* }} tests verify that uploads do release all blocks by 
way of the statistics counters
# the callables on the uploads always call Block.close() in a finally clause. 
This would have leaked buffers/files otherwise.
# probably some log at info that could be cut back on

I can't replicate the problem which I referred to in HADOOP-13560 about stream 
being active after block closure. I think if we used the transfer manager and 
didn''t block for the uploads to complete this would happen (as the original 
S3aOutputStream does), but as single part puts are put direct, &multiparts  
block for the end tag, all uploads appear to be synchronous.

As noted, I'm not seeing the multipart uploads closing the output streams *at 
all*, not from the logs.

I want to apply this to branch-2,8 to see what happens on the older SDK

> S3A block output streams don't clear temporary files
> ----------------------------------------------------
>
>                 Key: HADOOP-14028
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14028
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 3.0.0-alpha2
>         Environment: JDK 8 + ORC 1.3.0 + hadoop-aws 3.0.0-alpha2
>            Reporter: Seth Fitzsimmons
>            Assignee: Steve Loughran
>         Attachments: HADOOP-14028-branch-2-001.patch
>
>
> I have `fs.s3a.fast.upload` enabled with 3.0.0-alpha2 (it's exactly what I 
> was looking for after running into the same OOM problems) and don't see it 
> cleaning up the disk-cached blocks.
> I'm generating a ~50GB file on an instance with ~6GB free when the process 
> starts. My expectation is that local copies of the blocks would be deleted 
> after those parts finish uploading, but I'm seeing more than 15 blocks in 
> /tmp (and none of them have been deleted thus far).
> I see that DiskBlock deletes temporary files when closed, but is it closed 
> after individual blocks have finished uploading or when the entire file has 
> been fully written to the FS (full upload completed, including all parts)?
> As a temporary workaround to avoid running out of space, I'm listing files, 
> sorting by atime, and deleting anything older than the first 20: `ls -ut | 
> tail -n +21 | xargs rm`
> Steve Loughran says:
> > They should be deleted as soon as the upload completes; the close() call 
> > that the AWS httpclient makes on the input stream triggers the deletion. 
> > Though there aren't tests for it, as I recall.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to