I'm not 100% sure if this is the right forum, but admins feel free to move it 
if it's not.

The issue I have is in short that whenever I fire up hadoop data node on the 
new osol system and start writing files to the HDFS on other nodes, the system 
will work for a few minutes writing files as it's supposed to and then hang. 
And I mean really hang. I've got access to the remote console (it's a 
supermicro MoBo with BMC on dedicated lan) and it's unresponsive, I've gone to 
the datacenter and verified with a keyboard and monitor that it's unresponsive. 

Now a bit for the background. I started with osol 111b or 2009.06 and the 
system worked just fine while I was using dCache for the storage virtualization 
and FDT for transfers. I transferred over 14TB over ca 30h to the locally set 
up zfs (zpool of 2x raid-z of 12 disks) and all was fine. The problems started 
when I installed hadoop and started to work in that environment. 

The machine itself is:
- 2 x Intel E5520 2266MHz Quad-Core Xeon DP
- Supermicro X8DTL-3F
- 2 x 6GB 1333Mhz KVR1333D3E9SK3/6G
- Areca ARC-1280ML, 24-port SATA-II HW RAID, PCI-E
- Areca ARC-BBM Battery Backup Module
- DIMM 2GB DDR2 PC5300 667Mhz, ECC for ARC-1280ML
- Quad port ethernet Intel (e1000g driver)

So to start debugging the odd behavior (btw reading files from hadoop by 
clients is just fine, no hang there) I first booted the machine with -k -v 
options and once it hanged I used F1-A to get to kmdb (boy do I consider myself 
lucky now in hindsight) and initiated a panic and dump. That dump I still have 
if anyone has interest or ideas. Once it booted up again the system reported 
some Intel CPU errors as can be seen here: http://pastebin.com/m2127ae0c going 
by this I had today the manufacturer replace the CPU-s, but alas it wasn't the 
cause (or I got a second batch of faulty ones, doubtful). 

While I was waiting for the new CPU-s I continued to debug trying to watch in 
mpstat, vmstat, iostat and so forth, but didn't notice anything really out of 
the ordinary happening just before the hang. During the file transfers the 
interrupts go up to ca 12k, the system calls up to 60k and context switches to 
70k, then they fall down again and go up again as the blocks in hadoop arrive. 
I've seen hangs during high cs periods (mostly) and during low ones (a few 
times) so didn't find correlation there. 

However one steady thing that I discovered was that I no longer could break 
from the hang to kmdb. No matter how many times I tried F1-A on however many 
hangs I never got back to kmdb anymore. I've been able to get to kmdb on four 
other occasions, but always from an idle shell and even those were rare 
occurrences, in most cases F1-A doesn't seem to be doing anything. 

So I though maybe it's a hard hang and added 

set snooping=1
set snoop_interval=90000000

into /etc/system. The next hang I allowed the system some time, but nothing 
happened. In the first few attempts I didn't even have the snoop_interval 
defined so I assumed it should be 50s, but even letting the system hang for a 
full night didn't change anything. 

So I am a bit at loss on how to proceed as obviously a user level process 
(running under user hadoop even, not root) should not be able to trash the 
system as it is. I have in the meantime upgraded to snv_128a with no change in 
the behavior. I have not upgraded zfs as I was considering that maybe if I 
reinstall the machine to Solaris 10 then it'll be able to mount the zfs pools. 
Just to try if this changes anything. 

Namely I have three other nodes that are similar, but not exact. They are a bit 
older nodes so they feature E5420 cpu-s and a previous model of the 
motherboard, they also have a tick less memory and use 750GB disks. Those 
systems run Solaris 10 and no matter how much I transfer to them through hadoop 
nothing moves them. So this makes me believe that there are a few options that 
could be bad here:

1) a bug in OpenSolaris kernel (driver or not)
2) bad motherboard (hmm.... doubt a bit)
3) bad Areca controller
4) bad disks
5) bad memory

options 2-5 I would exclude because the hang doesn't happen because of big 
amount of transfers as I can literally transfer tens of TB to it using other 
tools (which also are written in Java btw as is hadoop). 

So this leads me to think that this might be an osol bug, but beyond what I 
have done so far I can't really think of how to continue. The system should be 
in production already for a week, but I can't put it there as its main purpose 
is to run hadoop datanode and it can't handle that for longer than 5 minutes at 
the moment. 

Any ideas are welcome, I've mostly exhausted the ones I've collected on 
#opensolaris over the past four days (you can look up the IRC logs if you want, 
I'm mario_ there). If I forgot anything, let me know and I'll add info as it's 
requested. 

Things I remembered now, that I've tried:
* Looked at ARC size, there were multiple GB still free and at times the ARC 
has been only ca 2.5GB while total memory is 12GB. 
* running truss on the hadoop java process, nothing out of the ordinary found 
at the end of it. The only thing that I did notice was that there were a lot of 
EAGAIN errors on sockets that after googling I understood come from it trying 
to get more data than the socket is currently providing. 
* I've tried redirecting the console to COM2 and connecting with ipmitool sol 
to it, that worked just fine, but when it hung and I sent ~B that even told me 
that it sent a break, then again, it didn't break me into kmdb. 
* The error happens 100% of the time, there are no log messages involved. The 
normal activities just stop and then there's boot. 
* The system does have trunking enabled on external interfaces, but the 
transfers happen on the internal one (without trunking) and I've tried also 
without the trunk, still hangs. 
* I've tried the NMI approach by adding set pcplusmp:apic_kmdb_on_nmi=1 to 
/etc/system and sending ipmitool chassis power diag to the system, but nothing 
happened. I tried the same ipmi  command while in shell and never saw the NMI 
message that was supposed to be displayed according to the blog from where I 
got this idea so maybe the BMC doesn't pass it on...
-- 
This message posted from opensolaris.org
_______________________________________________
opensolaris-help mailing list
[email protected]

Reply via email to