+ Peter and hdfs-dev

Awesome, Siying! I am so impressed by the amount of work that you've done to 
improve HDFS I/O.

Could you please revert r23326 r23292 and r23290? I do not think that they will 
be useful for the warehouse use case.

Thanks!
Hairong

From: Siying Dong <siyin...@fb.com<mailto:siyin...@fb.com>>
Date: Mon, 5 Mar 2012 14:48:23 -0800
To: Dmytro Molkov <d...@fb.com<mailto:d...@fb.com>>, Hairong Kuang 
<hair...@fb.com<mailto:hair...@fb.com>>, Guoqiang Jerry Chen 
<gqc...@fb.com<mailto:gqc...@fb.com>>
Cc: Zheng Shao <zs...@fb.com<mailto:zs...@fb.com>>
Subject: RE: Datanodes optimizations

I cut svn+ssh://tubbs/svnhive/hadoop/branches/hive99-r19817-11012011-03022012 
ported some changes and deployed it (without changes I ported today) to test 
cluster last Friday. Here is the patch list I ported:


------------------------------------------------------------------------
r23326 | sdong | 2012-03-05 14:38:48 -0800 (Mon, 05 Mar 2012) | 20 lines


HDFS: Datanode adds received length for blocks being received

Summary:
In order to make the visible length the real stable length client to
use, when doing recover, data nodes should use received length for doing the
recovery




Differential Revision: https://phabricator.fb.com/D417973



------------------------------------------------------------------------
r23325 | sdong | 2012-03-05 14:36:19 -0800 (Mon, 05 Mar 2012) | 24 lines


HDFS Datanode: replace FSDataset.setVisibleLength() when doing read to 
activeFile.setVisibleLength() to reduce acquiring FSDataset.lock

Summary:
setVisibleLength() in write pipeline may not need to grab locks. Instead, it
can just hold ActiveFile. To avoid recovery thread from modifying the object,
modify the recovery thread codes to make a copy of it, instad of using the old
one.





Differential Revision: https://phabricator.fb.com/D404130



------------------------------------------------------------------------
r23296 | sdong | 2012-03-02 18:18:26 -0800 (Fri, 02 Mar 2012) | 24 lines



HDFS Datanode: ignore the error for block to delete is already not in memory

Summary:
There is a possibility that after block report, namenodes send
duplicate block to invalidate to a datanode, which has just been sent by
previous datanode. In that case, we should just ignore the exception and
shouldn't do a checkdir after that.

ant TestDatanodeRestart
ant TestSimulatedDataset




Differential Revision: https://phabricator.fb.com/D419779

------------------------------------------------------------------------
r23295 | sdong | 2012-03-02 18:15:39 -0800 (Fri, 02 Mar 2012) | 34 lines


HDFS: Datanode starts to wait for ack of a packet as soon as it is sent.

Summary:
Curently, HDFS Datanodes do this sequence:
1. receive a packet
2. forward the packet
3. write the packet to file
4. start to wait for ack of the packet with a timeout
The timeout for 4. is only slightly shorter than the client to wait for the
ack.
If for a DN1, 3. takes 3+ seconds and there is real connectivity or load issue
with the next DN2 (which DN1 forwards the packet to), client will timeout first
and determine DN1 to be bad, but leaves the real bad DN2 in the pipeline, which
will cause more permanent erorrs than needed

This patch makes the datanode start to wait for ack as soon as the packet is
forwarded to fix this issue.





Differential Revision: https://phabricator.fb.com/D406408



------------------------------------------------------------------------
r23294 | sdong | 2012-03-02 18:14:36 -0800 (Fri, 02 Mar 2012) | 25 lines



HDFS: eliminate locking in VolumeSet

Summary:
This patch tries to eliminate retrying global lock to access volumne map and
volumns.
It is a risky change. I first make it and see whether people are comfortable
with it before moving forward





Differential Revision: https://phabricator.fb.com/D398241



------------------------------------------------------------------------
r23293 | sdong | 2012-03-02 17:56:34 -0800 (Fri, 02 Mar 2012) | 21 lines


HDFS Datanode: isFinalized in read path doesn't require lock

Summary:
This is to remove a locking from read pipeline.
Notice this function call is only recenlty added and is still not in production
yet.





Differential Revision: https://phabricator.fb.com/D406447


------------------------------------------------------------------------
r23292 | sdong | 2012-03-02 17:53:03 -0800 (Fri, 02 Mar 2012) | 27 lines


HDFS: Client to update available() for files under-construction: part1 - Data
Protocol Change

Summary:
This is the first part of the feature for clients to update available() for
files under-construction. The data protocols. Changes are:
(1) Send the length of the block in the end of the block. Use negative sign to
show whether the block has been completed or not
(2) allow clients to ask for 0 bytes. If 0 byte is asked and no byte available,
send -2 or -3.
(3) datanodes setVisible() when receiving acks, instead of after receivint it.





Differential Revision: https://phabricator.fb.com/D386945



------------------------------------------------------------------------
r23290 | sdong | 2012-03-02 17:43:48 -0800 (Fri, 02 Mar 2012) | 22 lines


HDFS BlockReceiver to update visibility of a block when ack is received instead
of packet received

Summary:
This change could make a visibility variable more reliable for clients
to read. Also, it will help performance as we move a global lock from the
crucial data thread.





Differential Revision: https://phabricator.fb.com/D389226


------------------------------------------------------------------------
r23289 | sdong | 2012-03-02 17:32:45 -0800 (Fri, 02 Mar 2012) | 27 lines


HDFS Client: only start to wait for packet ack when there is something sent.

Summary:
Since heartbeats are only guarantee to send from client timeout/2,
while client started to wait for ack as soon as it received the previous one,
there is a good chance that client hit time-out earlier than the first
Datanode,
if the second or third data node has problem. And then the client will falsely
remove the first DN from the pipeline, which is wrong and it will cause more
permanent failure then it should.





Differential Revision: https://phabricator.fb.com/D405345



------------------------------------------------------------------------
r23287 | sdong | 2012-03-02 17:30:09 -0800 (Fri, 02 Mar 2012) | 23 lines


HDFS BlockReceiver: not check disk for ClosedChannelException when closing

Summary:
BlockReceiver.close() can be called twice: BlockReceiver.receiveBlock() and
DataWriter.writeBlock().
out.close() can throws exception from the first place, and the second time it
is likely to throw ClosedChannelException. There is no reason to check disk
there.
The patch is to skip disk checking for ClosedChannelException when CLOSING the
streams.




Differential Revision: https://phabricator.fb.com/D415558


------------------------------------------------------------------------
r23286 | sdong | 2012-03-02 17:28:54 -0800 (Fri, 02 Mar 2012) | 21 lines


HDFS Datanode: avoid parallel disk checking

Summary:
Parallel Disk Checking in datanodes doesn't help and significantly
impact the performance and reliablility. This patch avoid it by making the
second caller return directly.

I'll write a new unit test to cover DataNode.checkDiskError() to
make sure at least it doesn't regress.



Differential Revision: https://phabricator.fb.com/D414853


------------------------------------------------------------------------
r23285 | sdong | 2012-03-02 17:27:52 -0800 (Fri, 02 Mar 2012) | 21 lines



HDFS Datanode: no need to check disk for close interrupted exception

Summary:
In ScribeH clusters we experience massive checkdirs for interrupted
close(), which are might cause problem and are unnecessary. Remove it




Differential Revision: https://phabricator.fb.com/D411772



------------------------------------------------------------------------
r23284 | sdong | 2012-03-02 17:25:50 -0800 (Fri, 02 Mar 2012) | 34 lines


HDFS Datanode: not holding lock for FSDataSet.getVisibleLength()

Summary:
FSDataset.getVisibleLength() is called twice in read pipeline and it holds
FSDataset.lock. Remove the lock will improve read latency in some cases.

It's safe to remove this locking. The function is only called twice without
caller to hold FSDataset.lock when reading blocks.
The only potential data race after removing the lock is another thread does
removeOngoingCreates() in the mean time. It is only called in:
1. finalize block. For this case it's safe to continue use
activeFile.getVisibleLength() as it will continue being the right value.
2. unfinalized block. For this case, it will anyway have a risk to fail no
matter whether we hold the lock or not. So we don't care about it too much
3. recover block. What the recovery behavior is to move the same activeFile
block to the new block generation. The object will continue to be the right
object and what we read from FSDataset.getVisibleLength() will be the same as
either before the lock or after the lock. Either way it should work as today
(otherwise it's a bug even for today, which I think it actually exists).





Differential Revision: https://phabricator.fb.com/D406689


------------------------------------------------------------------------
r23283 | sdong | 2012-03-02 17:23:36 -0800 (Fri, 02 Mar 2012) | 29 lines


HDFS Datanode: remove locking when generating full block reports

Summary:
This patch removes locking when generating full block reports. It does several
modification:
1. remove the lock
2. Make FSDir.children fully immutable after initializing. Does it by made it
volitle and only set it after the array is fully populated
3. when reading FSDir.children, it calls getChildren(). I do this from
performance consideration, though I don't think it will make a difference to
our
performance. I just don't feel comfortable with accessing volatile objects
multiple times in curcial code path. It also feels better to make the access
pattern symmetric between read and write.





Differential Revision: https://phabricator.fb.com/D416111

From: Dmytro Molkov
Sent: Friday, March 02, 2012 5:08 PM
To: Hairong Kuang; Siying Dong; Guoqiang Jerry Chen
Cc: Zheng Shao
Subject: RE: Datanodes optimizations

That will be awesome! Thanks for the update, will be waiting for the results of 
the rollout.

From: Hairong Kuang
Sent: Friday, March 02, 2012 5:00 PM
To: Dmytro Molkov; Siying Dong; Guoqiang Jerry Chen; Hairong Kuang
Cc: Zheng Shao
Subject: Re: Datanodes optimizations

Just chatted with team regarding all the datanode fixes. Our plan is to cut yet 
another release candidate from our stable federation release. Siying will port 
all the fixes to the release candidate and run it on the dfstest cluster for a 
week. If no problem is found, we will roll it to dfstemp in the following week. 
Then hit dfssilver.

Hairong

From: Dmytro Molkov <d...@fb.com<mailto:d...@fb.com>>
Date: Fri, 2 Mar 2012 16:33:02 -0800
To: Siying Dong <siyin...@fb.com<mailto:siyin...@fb.com>>, Guoqiang Jerry Chen 
<gqc...@fb.com<mailto:gqc...@fb.com>>, Hairong Kuang 
<hair...@fb.com<mailto:hair...@fb.com>>
Cc: Zheng Shao <zs...@fb.com<mailto:zs...@fb.com>>
Subject: RE: Datanodes optimizations

And when will the federation be fully rolled out?

From: Siying Dong
Sent: Friday, March 02, 2012 4:30 PM
To: Dmytro Molkov; Guoqiang Jerry Chen; Hairong Kuang
Cc: Zheng Shao
Subject: RE: Datanodes optimizations

Now the current plan is, after federation is successfully roll-out, we slowly 
rollout datanodes again to deploy those fixes.

From: Dmytro Molkov
Sent: Friday, March 02, 2012 3:43 PM
To: Guoqiang Jerry Chen; Hairong Kuang; Siying Dong
Cc: Zheng Shao
Subject: Datanodes optimizations

Hey guys, today in the SLA meeting we were discussing the HDFS issues that were 
affecting the SLA jobs.
All Datanodes Are Bad error in particular was affecting us. The stuff that 
Siying is working on should help us a lot, but we need to figure out when all 
the fixes will make it into DFS1.
Are there any timelines for the new release? Is there a task tracking the 
datanode work that would also track the deployment of it?

Reply via email to