[ https://issues.apache.org/jira/browse/TEZ-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586993#comment-15586993 ]
Rajesh Balamohan commented on TEZ-3478: --------------------------------------- Haven't checked for ordered case yet, but should be present there as well. Created this ticket to handle cleanup of unordered data here. Will create subsequent jira for ordered case. > Cleanup fetcher data for failing task attempts (Unordered fetcher) > ------------------------------------------------------------------ > > Key: TEZ-3478 > URL: https://issues.apache.org/jira/browse/TEZ-3478 > Project: Apache Tez > Issue Type: Improvement > Reporter: Rajesh Balamohan > Assignee: Rajesh Balamohan > Priority: Minor > > Env: 3 node AWS cluster with entire dataset in S3. Since data is in S3, it > does have not additional storage for HDFS (uses existing space available in > VMs). tez version is 0.7. > With some workloads (e.g q29 in tpcds), unordered fetchers download data in > parallel for different vertices and runs out of disk space. However, > downloaded > data related to these failed task attempts are not cleared. So subsequent > task attempts also encounter similar situation and fails with "No space" > exception. e.g stack trace > {noformat} > , errorMessage=Fetch failed:org.apache.hadoop.fs.FSError: > java.io.IOException: No space left on device > at > org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:261) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58) > at java.io.DataOutputStream.write(DataOutputStream.java:107) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:426) > at > org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunks(FSOutputSummer.java:206) > at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:124) > at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:110) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58) > at java.io.DataOutputStream.write(DataOutputStream.java:107) > at > org.apache.tez.runtime.library.common.shuffle.ShuffleUtils.shuffleToDisk(ShuffleUtils.java:146) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.fetchInputs(Fetcher.java:771) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:497) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:396) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:195) > at > org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:70) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: No space left on device > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:345) > at > org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSy > {noformat} > This would also affect any other job running in the cluster at the same time. > It would be helpful to clean up the data downloaded for the failed task > attempts. > Creating this ticket mainly for unordered fetcher case, though it could be > similar case for ordered shuffle case as well. > e.g files > {noformat} > 17M > /hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_000028_0_10023_src_62_spill_-1.out > 18M > /hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_000028_0_10023_src_63_spill_-1.out > 16M > /hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_000028_0_10023_src_64_spill_-1.out > .. > .. > 18M > /hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_000028_2_10003_src_0_spill_-1.out > 17M > /hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_000028_2_10003_src_13_spill_-1.out > 16M > /hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_000028_2_10003_src_15_spill_-1.out > 16M > /hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_000028_2_10003_src_17_spill_-1.ou > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)