+ Peter and hdfs-dev Awesome, Siying! I am so impressed by the amount of work that you've done to improve HDFS I/O.
Could you please revert r23326 r23292 and r23290? I do not think that they will be useful for the warehouse use case. Thanks! Hairong From: Siying Dong <siyin...@fb.com<mailto:siyin...@fb.com>> Date: Mon, 5 Mar 2012 14:48:23 -0800 To: Dmytro Molkov <d...@fb.com<mailto:d...@fb.com>>, Hairong Kuang <hair...@fb.com<mailto:hair...@fb.com>>, Guoqiang Jerry Chen <gqc...@fb.com<mailto:gqc...@fb.com>> Cc: Zheng Shao <zs...@fb.com<mailto:zs...@fb.com>> Subject: RE: Datanodes optimizations I cut svn+ssh://tubbs/svnhive/hadoop/branches/hive99-r19817-11012011-03022012 ported some changes and deployed it (without changes I ported today) to test cluster last Friday. Here is the patch list I ported: ------------------------------------------------------------------------ r23326 | sdong | 2012-03-05 14:38:48 -0800 (Mon, 05 Mar 2012) | 20 lines HDFS: Datanode adds received length for blocks being received Summary: In order to make the visible length the real stable length client to use, when doing recover, data nodes should use received length for doing the recovery Differential Revision: https://phabricator.fb.com/D417973 ------------------------------------------------------------------------ r23325 | sdong | 2012-03-05 14:36:19 -0800 (Mon, 05 Mar 2012) | 24 lines HDFS Datanode: replace FSDataset.setVisibleLength() when doing read to activeFile.setVisibleLength() to reduce acquiring FSDataset.lock Summary: setVisibleLength() in write pipeline may not need to grab locks. Instead, it can just hold ActiveFile. To avoid recovery thread from modifying the object, modify the recovery thread codes to make a copy of it, instad of using the old one. Differential Revision: https://phabricator.fb.com/D404130 ------------------------------------------------------------------------ r23296 | sdong | 2012-03-02 18:18:26 -0800 (Fri, 02 Mar 2012) | 24 lines HDFS Datanode: ignore the error for block to delete is already not in memory Summary: There is a possibility that after block report, namenodes send duplicate block to invalidate to a datanode, which has just been sent by previous datanode. In that case, we should just ignore the exception and shouldn't do a checkdir after that. ant TestDatanodeRestart ant TestSimulatedDataset Differential Revision: https://phabricator.fb.com/D419779 ------------------------------------------------------------------------ r23295 | sdong | 2012-03-02 18:15:39 -0800 (Fri, 02 Mar 2012) | 34 lines HDFS: Datanode starts to wait for ack of a packet as soon as it is sent. Summary: Curently, HDFS Datanodes do this sequence: 1. receive a packet 2. forward the packet 3. write the packet to file 4. start to wait for ack of the packet with a timeout The timeout for 4. is only slightly shorter than the client to wait for the ack. If for a DN1, 3. takes 3+ seconds and there is real connectivity or load issue with the next DN2 (which DN1 forwards the packet to), client will timeout first and determine DN1 to be bad, but leaves the real bad DN2 in the pipeline, which will cause more permanent erorrs than needed This patch makes the datanode start to wait for ack as soon as the packet is forwarded to fix this issue. Differential Revision: https://phabricator.fb.com/D406408 ------------------------------------------------------------------------ r23294 | sdong | 2012-03-02 18:14:36 -0800 (Fri, 02 Mar 2012) | 25 lines HDFS: eliminate locking in VolumeSet Summary: This patch tries to eliminate retrying global lock to access volumne map and volumns. It is a risky change. I first make it and see whether people are comfortable with it before moving forward Differential Revision: https://phabricator.fb.com/D398241 ------------------------------------------------------------------------ r23293 | sdong | 2012-03-02 17:56:34 -0800 (Fri, 02 Mar 2012) | 21 lines HDFS Datanode: isFinalized in read path doesn't require lock Summary: This is to remove a locking from read pipeline. Notice this function call is only recenlty added and is still not in production yet. Differential Revision: https://phabricator.fb.com/D406447 ------------------------------------------------------------------------ r23292 | sdong | 2012-03-02 17:53:03 -0800 (Fri, 02 Mar 2012) | 27 lines HDFS: Client to update available() for files under-construction: part1 - Data Protocol Change Summary: This is the first part of the feature for clients to update available() for files under-construction. The data protocols. Changes are: (1) Send the length of the block in the end of the block. Use negative sign to show whether the block has been completed or not (2) allow clients to ask for 0 bytes. If 0 byte is asked and no byte available, send -2 or -3. (3) datanodes setVisible() when receiving acks, instead of after receivint it. Differential Revision: https://phabricator.fb.com/D386945 ------------------------------------------------------------------------ r23290 | sdong | 2012-03-02 17:43:48 -0800 (Fri, 02 Mar 2012) | 22 lines HDFS BlockReceiver to update visibility of a block when ack is received instead of packet received Summary: This change could make a visibility variable more reliable for clients to read. Also, it will help performance as we move a global lock from the crucial data thread. Differential Revision: https://phabricator.fb.com/D389226 ------------------------------------------------------------------------ r23289 | sdong | 2012-03-02 17:32:45 -0800 (Fri, 02 Mar 2012) | 27 lines HDFS Client: only start to wait for packet ack when there is something sent. Summary: Since heartbeats are only guarantee to send from client timeout/2, while client started to wait for ack as soon as it received the previous one, there is a good chance that client hit time-out earlier than the first Datanode, if the second or third data node has problem. And then the client will falsely remove the first DN from the pipeline, which is wrong and it will cause more permanent failure then it should. Differential Revision: https://phabricator.fb.com/D405345 ------------------------------------------------------------------------ r23287 | sdong | 2012-03-02 17:30:09 -0800 (Fri, 02 Mar 2012) | 23 lines HDFS BlockReceiver: not check disk for ClosedChannelException when closing Summary: BlockReceiver.close() can be called twice: BlockReceiver.receiveBlock() and DataWriter.writeBlock(). out.close() can throws exception from the first place, and the second time it is likely to throw ClosedChannelException. There is no reason to check disk there. The patch is to skip disk checking for ClosedChannelException when CLOSING the streams. Differential Revision: https://phabricator.fb.com/D415558 ------------------------------------------------------------------------ r23286 | sdong | 2012-03-02 17:28:54 -0800 (Fri, 02 Mar 2012) | 21 lines HDFS Datanode: avoid parallel disk checking Summary: Parallel Disk Checking in datanodes doesn't help and significantly impact the performance and reliablility. This patch avoid it by making the second caller return directly. I'll write a new unit test to cover DataNode.checkDiskError() to make sure at least it doesn't regress. Differential Revision: https://phabricator.fb.com/D414853 ------------------------------------------------------------------------ r23285 | sdong | 2012-03-02 17:27:52 -0800 (Fri, 02 Mar 2012) | 21 lines HDFS Datanode: no need to check disk for close interrupted exception Summary: In ScribeH clusters we experience massive checkdirs for interrupted close(), which are might cause problem and are unnecessary. Remove it Differential Revision: https://phabricator.fb.com/D411772 ------------------------------------------------------------------------ r23284 | sdong | 2012-03-02 17:25:50 -0800 (Fri, 02 Mar 2012) | 34 lines HDFS Datanode: not holding lock for FSDataSet.getVisibleLength() Summary: FSDataset.getVisibleLength() is called twice in read pipeline and it holds FSDataset.lock. Remove the lock will improve read latency in some cases. It's safe to remove this locking. The function is only called twice without caller to hold FSDataset.lock when reading blocks. The only potential data race after removing the lock is another thread does removeOngoingCreates() in the mean time. It is only called in: 1. finalize block. For this case it's safe to continue use activeFile.getVisibleLength() as it will continue being the right value. 2. unfinalized block. For this case, it will anyway have a risk to fail no matter whether we hold the lock or not. So we don't care about it too much 3. recover block. What the recovery behavior is to move the same activeFile block to the new block generation. The object will continue to be the right object and what we read from FSDataset.getVisibleLength() will be the same as either before the lock or after the lock. Either way it should work as today (otherwise it's a bug even for today, which I think it actually exists). Differential Revision: https://phabricator.fb.com/D406689 ------------------------------------------------------------------------ r23283 | sdong | 2012-03-02 17:23:36 -0800 (Fri, 02 Mar 2012) | 29 lines HDFS Datanode: remove locking when generating full block reports Summary: This patch removes locking when generating full block reports. It does several modification: 1. remove the lock 2. Make FSDir.children fully immutable after initializing. Does it by made it volitle and only set it after the array is fully populated 3. when reading FSDir.children, it calls getChildren(). I do this from performance consideration, though I don't think it will make a difference to our performance. I just don't feel comfortable with accessing volatile objects multiple times in curcial code path. It also feels better to make the access pattern symmetric between read and write. Differential Revision: https://phabricator.fb.com/D416111 From: Dmytro Molkov Sent: Friday, March 02, 2012 5:08 PM To: Hairong Kuang; Siying Dong; Guoqiang Jerry Chen Cc: Zheng Shao Subject: RE: Datanodes optimizations That will be awesome! Thanks for the update, will be waiting for the results of the rollout. From: Hairong Kuang Sent: Friday, March 02, 2012 5:00 PM To: Dmytro Molkov; Siying Dong; Guoqiang Jerry Chen; Hairong Kuang Cc: Zheng Shao Subject: Re: Datanodes optimizations Just chatted with team regarding all the datanode fixes. Our plan is to cut yet another release candidate from our stable federation release. Siying will port all the fixes to the release candidate and run it on the dfstest cluster for a week. If no problem is found, we will roll it to dfstemp in the following week. Then hit dfssilver. Hairong From: Dmytro Molkov <d...@fb.com<mailto:d...@fb.com>> Date: Fri, 2 Mar 2012 16:33:02 -0800 To: Siying Dong <siyin...@fb.com<mailto:siyin...@fb.com>>, Guoqiang Jerry Chen <gqc...@fb.com<mailto:gqc...@fb.com>>, Hairong Kuang <hair...@fb.com<mailto:hair...@fb.com>> Cc: Zheng Shao <zs...@fb.com<mailto:zs...@fb.com>> Subject: RE: Datanodes optimizations And when will the federation be fully rolled out? From: Siying Dong Sent: Friday, March 02, 2012 4:30 PM To: Dmytro Molkov; Guoqiang Jerry Chen; Hairong Kuang Cc: Zheng Shao Subject: RE: Datanodes optimizations Now the current plan is, after federation is successfully roll-out, we slowly rollout datanodes again to deploy those fixes. From: Dmytro Molkov Sent: Friday, March 02, 2012 3:43 PM To: Guoqiang Jerry Chen; Hairong Kuang; Siying Dong Cc: Zheng Shao Subject: Datanodes optimizations Hey guys, today in the SLA meeting we were discussing the HDFS issues that were affecting the SLA jobs. All Datanodes Are Bad error in particular was affecting us. The stuff that Siying is working on should help us a lot, but we need to figure out when all the fixes will make it into DFS1. Are there any timelines for the new release? Is there a task tracking the datanode work that would also track the deployment of it?