Hi Mike,
Thanks for trying to help out.
I had a talk with our networking guys this afternoon. According to them (and
this is way out of my area of expertise, so excuse any mistakes) multiple
interfaces shouldn't be a problem. We could set up a nameserver to resolve
hostnames to addresses in our private space when the request comes from one of
the nodes, and route this traffic over a single interface. Any other request
can be resolved to an address in the public space, which is bound to an other
interface. In our current setup we're not even resolving hostnames in our
private address space through a nameserver - we do it with an ugly hack in
/etc/hosts. And it seems to work alright.
Having said that, our problems are still not completely gone even after
adjusting the maximum allowed RAM for tasks - although things are lots better.
While writing this mail three out of five DN's were marked as dead. There still
is some swapping going on, but the cores are not spending any time in WAIT, so
this shouldn't be the cause of anything. See below a trace from a dead DN - any
thoughts are appreciated!
Cheers,
Evert
2011-05-13 23:13:27,716 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Received block blk_-9131821326787012529_2915672 src: /192.168.28.211:60136
dest: /192.168.28.214:50050 of size 382425
2011-05-13 23:13:27,915 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Exception in receiveBlock for block blk_-9132067116195286882_130888
java.io.EOFException: while trying to read 3744913 bytes
2011-05-13 23:13:27,925 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.28.214:50050, dest: /192.168.28.214:35139, bytes: 0, op: HDFS_READ,
cliID: DFSClient_attempt_201105131125_0025_m_001437_0, offset: 196608, srvID:
DS-443352839-145.100.2.183-50050-1291128673616, blockid:
blk_-9163184839986480695_4112368, duration: 6254000
2011-05-13 23:13:28,032 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Received block blk_-9149862728087355005_3793421 src: /192.168.28.210:41197
dest: /192.168.28.214:50050 of size 245767
2011-05-13 23:13:28,033 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
Block blk_-9132067116195286882_130888 unfinalized and removed.
2011-05-13 23:13:28,033 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
writeBlock blk_-9132067116195286882_130888 received exception
java.io.EOFException: while trying to read 3744913 bytes
2011-05-13 23:13:28,033 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(192.168.28.214:50050,
storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075,
ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 3744913 bytes
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
2011-05-13 23:13:28,038 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.28.214:50050, dest: /192.168.28.214:32910, bytes: 0, op: HDFS_READ,
cliID: DFSClient_attempt_201105131125_0025_m_001443_0, offset: 197632, srvID:
DS-443352839-145.100.2.183-50050-1291128673616, blockid:
blk_-9163184839986480695_4112368, duration: 4323000
2011-05-13 23:13:28,038 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.28.214:50050, dest: /192.168.28.214:35138, bytes: 0, op: HDFS_READ,
cliID: DFSClient_attempt_201105131125_0025_m_001440_0, offset: 197120, srvID:
DS-443352839-145.100.2.183-50050-1291128673616, blockid:
blk_-9163184839986480695_4112368, duration: 5573000
2011-05-13 23:13:28,159 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.28.214:50050, dest: /192.168.28.212:38574, bytes: 0, op: HDFS_READ,
cliID: DFSClient_attempt_201105131125_0025_m_001444_0, offset: 197632, srvID:
DS-443352839-145.100.2.183-50050-1291128673616, blockid:
blk_-9163184839986480695_4112368, duration: 16939000
2011-05-13 23:13:28,209 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Received block blk_-9123390874940601805_2898225 src: /192.168.28.210:44227
dest: /192.168.28.214:50050 of size 300441
2011-05-13 23:13:28,217 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.28.214:50050, dest: /192.168.28.213:42364, bytes: 0, op: HDFS_READ,
cliID: DFSClient_attempt_201105131125_0025_m_001451_0, offset: 198656, srvID:
DS-443352839-145.100.2.183-50050-1291128673616, blockid:
blk_-9163184839986480695_4112368, duration: 5291000
2011-05-13 23:13:28,252 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.28.214:50050, dest: /192.168.28.214:32930, bytes: 0, op: HDFS_READ,
cliID: DFSClient_attempt_201105131125_0025_m_001436_0, offset: 0, srvID:
DS-443352839-145.100.2.183-50050-1291128673616, blockid:
blk_-1800696633107072247_4099834, duration: 5099000
2011-05-13 23:13:28,256 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.28.214:50050, dest: /192.168.28.213:42363, bytes: 0, op: HDFS_READ,
cliID: DFSClient_attempt_201105131125_0025_m_001458_0, offset: 199680, srvID:
DS-443352839-145.100.2.183-50050-1291128673616, blockid:
blk_-9163184839986480695_4112368, duration: 4945000
2011-05-13 23:13:28,257 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.28.214:50050, dest: /192.168.28.214:35137, bytes: 0, op: HDFS_READ,
cliID: DFSClient_attempt_201105131125_0025_m_001436_0, offset: 196608, srvID:
DS-443352839-145.100.2.183-50050-1291128673616, blockid:
blk_-9163184839986480695_4112368, duration: 4159000
2011-05-13 23:13:28,258 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Exception in receiveBlock for block blk_-9140444589483291821_3585975
java.io.EOFException: while trying to read 100 bytes
2011-05-13 23:13:28,258 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
Block blk_-9140444589483291821_3585975 unfinalized and removed.
2011-05-13 23:13:28,258 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
writeBlock blk_-9140444589483291821_3585975 received exception
java.io.EOFException: while trying to read 100 bytes
2011-05-13 23:13:28,259 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(192.168.28.214:50050,
storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075,
ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 100 bytes
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
2011-05-13 23:13:28,264 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.28.214:50050, dest: /192.168.28.212:38553, bytes: 0, op: HDFS_READ,
cliID: DFSClient_attempt_201105131125_0025_m_001441_0, offset: 0, srvID:
DS-443352839-145.100.2.183-50050-1291128673616, blockid:
blk_-5819719631677148140_4098274, duration: 5625000
2011-05-13 23:13:28,264 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.28.214:50050, dest: /192.168.28.212:38535, bytes: 0, op: HDFS_READ,
cliID: DFSClient_attempt_201105131125_0025_m_001438_0, offset: 196608, srvID:
DS-443352839-145.100.2.183-50050-1291128673616, blockid:
blk_-9163184839986480695_4112368, duration: 4473000
2011-05-13 23:13:28,265 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(192.168.28.214:50050,
storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075,
ipcPort=50020): Exception writing block blk_-9150014886921014525_2267869 to
mirror 192.168.28.213:50050
java.io.IOException: The stream is closed
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:108)
at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
at java.io.DataOutputStream.flush(DataOutputStream.java:106)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:540)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
2011-05-13 23:13:28,265 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/192.168.28.214:50050, dest: /192.168.28.213:45484, bytes: 0, op: HDFS_READ,
cliID: DFSClient_attempt_201105131125_0025_m_001432_0, offset: 0, srvID:
DS-443352839-145.100.2.183-50050-1291128673616, blockid:
blk_405051931214094755_4098504, duration: 5597000
2011-05-13 23:13:28,273 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Received block blk_-9150014886921014525_2267869 src: /192.168.28.211:49208
dest: /192.168.28.214:50050 of size 3033173
2011-05-13 23:13:28,313 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Received block blk_-9144765354308563975_3310572 src: /192.168.28.211:51592
dest: /192.168.28.214:50050 of size 242383
________________________________________
From: Segel, Mike [[email protected]]
Sent: Friday, May 13, 2011 2:36 PM
To: [email protected]
Cc: <[email protected]>; <[email protected]>
Subject: Re: Stability issue - dead DN's
Bonded will work but you may not see the performance you would expect. If you
need >1 GBe, go 10GBe less headache and has even more headroom.
Multiple interfaces won't work. Or I should say didn't work in past releases.
If you think about it, clients have to connect to each node. So having two
interfaces and trying to manage them makes no sense.
Add to this trying to manage this in DNS ... Why make more work for yourself?
Going from memory... It looked like you rDNS had to match you hostnames so your
internal interfaces had to match hostnames so you had an inverted network.
If you draw out your network topology you end up with a ladder.
You would be better off (IMHO) to create a subnet where only your edge servers
are dual nic'd.
But then if your cluster is for development... Now your PCs can't be used as
clients...
Does this make sense?
Sent from a remote device. Please excuse any typos...
Mike Segel
On May 13, 2011, at 4:57 AM, "Evert Lammerts" <[email protected]> wrote:
> Hi Mike,
>
>> You really really don't want to do this.
>> Long story short... It won't work.
>
> Can you elaborate? Are you talking about the bonded interfaces or about
> having a separated network for interconnects and external network? What can
> go wrong there?
>
>>
>> Just a suggestion.. You don't want anyone on your cluster itself. They
>> should interact wit edge nodes, which are 'Hadoop aware'. Then your
>> cluster has a single network to worry about.
>
> That's our current setup. We have a single headnode that is used as a SPOE.
> However, I'd like to change that on our future production system. We want to
> implement Kerberos for authentication, and let users interact with the
> cluster from their own machine. This would enable them to submit their jobs
> from the local IDE. The only way to do this is by opening up Hadoop ports for
> the world, is my understanding: if people interact with HDFS they need to be
> able to interact with all nodes, right? What would be the argument against
> this?
>
> Cheers,
> Evert
>
>>
>>
>> Sent from a remote device. Please excuse any typos...
>>
>> Mike Segel
>>
>> On May 11, 2011, at 11:45 AM, Allen Wittenauer <[email protected]> wrote:
>>
>>>
>>>
>>>
>>>
>>>>> * a 2x1GE bonded network interface for interconnects
>>>>> * a 2x1GE bonded network interface for external access
>>>
>>> Multiple NICs on a box can sometimes cause big performance
>> problems with Hadoop. So watch your traffic carefully.
>>>
>>>
>>>
The information contained in this communication may be CONFIDENTIAL and is
intended only for the use of the recipient(s) named above. If you are not the
intended recipient, you are hereby notified that any dissemination,
distribution, or copying of this communication, or any of its contents, is
strictly prohibited. If you have received this communication in error, please
notify the sender and delete/destroy the original message and any copy of it
from your computer or paper files.