[jira] [Updated] (HDFS-15398) EC: hdfs client hangs due to exception during addBlock

2020-06-09 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HDFS-15398:

Fix Version/s: 3.4.0
   3.3.1
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> EC: hdfs client hangs due to exception during addBlock
> --
>
> Key: HDFS-15398
> URL: https://issues.apache.org/jira/browse/HDFS-15398
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, hdfs-client
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Fix For: 3.3.1, 3.4.0
>
> Attachments: HDFS-15398.001.patch, HDFS-15398.002.patch, 
> HDFS-15398.003.patch, HDFS-15398.004.patch
>
>
>  In the operation of writing EC files, when the client calls addBlock() 
> applying for the second block group (or >= the second block group) and it 
> happens to exceed quota at this time, the client program will hang forever. 
>  See the demo below:
> {code:java}
> $ hadoop fs -mkdir -p /user/wanghongbing/quota/ec
> $ hdfs dfsadmin -setSpaceQuota 2g /user/wanghongbing/quota
> $ hdfs ec -setPolicy -path /user/wanghongbing/quota/ec -policy RS-6-3-1024k
> Set RS-6-3-1024k erasure coding policy on /user/wanghongbing/quota/ec
> $ hadoop fs -put 800m /user/wanghongbing/quota/ec
> ^@^@^@^@^@^@^@^@^Z
> {code}
> In the case of blocksize=128M, spaceQuota=2g and EC 6-3 policy, a block group 
> needs to apply for 1152M physical space to write 768M logical data. 
> Therefore, writing 800M data will exceed quota when applying for the second 
> block group. At this point, the client will be hang forever.
> The exception stack of client is as follows:
> {code:java}
> java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x8009d5d8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
> at 
> java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
> at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.takeWithTimeout(DFSStripedOutputStream.java:117)
> at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream.waitEndBlocks(DFSStripedOutputStream.java:453)
> at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream.allocateNewBlock(DFSStripedOutputStream.java:477)
> at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream.writeChunk(DFSStripedOutputStream.java:541)
> - locked <0x8009f758> (a 
> org.apache.hadoop.hdfs.DFSStripedOutputStream)
> at 
> org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunks(FSOutputSummer.java:217)
> at 
> org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:164)
> - locked <0x8009f758> (a 
> org.apache.hadoop.hdfs.DFSStripedOutputStream)
> at 
> org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:145)
> - locked <0x8009f758> (a 
> org.apache.hadoop.hdfs.DFSStripedOutputStream)
> at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream.closeImpl(DFSStripedOutputStream.java:1182)
> - locked <0x8009f758> (a 
> org.apache.hadoop.hdfs.DFSStripedOutputStream)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:847)
> - locked <0x8009f758> (a 
> org.apache.hadoop.hdfs.DFSStripedOutputStream)
> at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
> at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
> at org.apache.hadoop.io.IOUtils.cleanupWithLogger(IOUtils.java:280)
> at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:298)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:77)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:129)
> at 
> org.apache.hadoop.fs.shell.CommandWithDestination$TargetFileSystem.writeStreamToFile(CommandWithDestination.java:485)
> at 
> org.apache.hadoop.fs.shell.CommandWithDestination.copyStreamToTarget(CommandWithDestination.java:407)
> at 
> org.apache.hadoop.fs.shell.CommandWithDestination.copyFileToTarget(CommandWithDestination.java:342)
> at 
> org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:277)
> at 
> org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:262)
> {code}
> When an exception occurs in addBlock, the pro

[jira] [Updated] (HDFS-15398) EC: hdfs client hangs due to exception during addBlock

2020-06-09 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HDFS-15398:

Summary: EC: hdfs client hangs due to exception during addBlock  (was: EC: 
hdfs client hangs when writing EC file occurs an addBlock exception)

> EC: hdfs client hangs due to exception during addBlock
> --
>
> Key: HDFS-15398
> URL: https://issues.apache.org/jira/browse/HDFS-15398
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, hdfs-client
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Critical
> Attachments: HDFS-15398.001.patch, HDFS-15398.002.patch, 
> HDFS-15398.003.patch, HDFS-15398.004.patch
>
>
>  In the operation of writing EC files, when the client calls addBlock() 
> applying for the second block group (or >= the second block group) and it 
> happens to exceed quota at this time, the client program will hang forever. 
>  See the demo below:
> {code:java}
> $ hadoop fs -mkdir -p /user/wanghongbing/quota/ec
> $ hdfs dfsadmin -setSpaceQuota 2g /user/wanghongbing/quota
> $ hdfs ec -setPolicy -path /user/wanghongbing/quota/ec -policy RS-6-3-1024k
> Set RS-6-3-1024k erasure coding policy on /user/wanghongbing/quota/ec
> $ hadoop fs -put 800m /user/wanghongbing/quota/ec
> ^@^@^@^@^@^@^@^@^Z
> {code}
> In the case of blocksize=128M, spaceQuota=2g and EC 6-3 policy, a block group 
> needs to apply for 1152M physical space to write 768M logical data. 
> Therefore, writing 800M data will exceed quota when applying for the second 
> block group. At this point, the client will be hang forever.
> The exception stack of client is as follows:
> {code:java}
> java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x8009d5d8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
> at 
> java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
> at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.takeWithTimeout(DFSStripedOutputStream.java:117)
> at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream.waitEndBlocks(DFSStripedOutputStream.java:453)
> at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream.allocateNewBlock(DFSStripedOutputStream.java:477)
> at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream.writeChunk(DFSStripedOutputStream.java:541)
> - locked <0x8009f758> (a 
> org.apache.hadoop.hdfs.DFSStripedOutputStream)
> at 
> org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunks(FSOutputSummer.java:217)
> at 
> org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:164)
> - locked <0x8009f758> (a 
> org.apache.hadoop.hdfs.DFSStripedOutputStream)
> at 
> org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:145)
> - locked <0x8009f758> (a 
> org.apache.hadoop.hdfs.DFSStripedOutputStream)
> at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream.closeImpl(DFSStripedOutputStream.java:1182)
> - locked <0x8009f758> (a 
> org.apache.hadoop.hdfs.DFSStripedOutputStream)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:847)
> - locked <0x8009f758> (a 
> org.apache.hadoop.hdfs.DFSStripedOutputStream)
> at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
> at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
> at org.apache.hadoop.io.IOUtils.cleanupWithLogger(IOUtils.java:280)
> at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:298)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:77)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:129)
> at 
> org.apache.hadoop.fs.shell.CommandWithDestination$TargetFileSystem.writeStreamToFile(CommandWithDestination.java:485)
> at 
> org.apache.hadoop.fs.shell.CommandWithDestination.copyStreamToTarget(CommandWithDestination.java:407)
> at 
> org.apache.hadoop.fs.shell.CommandWithDestination.copyFileToTarget(CommandWithDestination.java:342)
> at 
> org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:277)
> at 
> org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:262)
> {code}
> When an exception occurs in addBlock, the program will call 
> DFSStripedOutputStream.clo