I'm not 100% sure if this is the right forum, but admins feel free to move it if it's not.
The issue I have is in short that whenever I fire up hadoop data node on the new osol system and start writing files to the HDFS on other nodes, the system will work for a few minutes writing files as it's supposed to and then hang. And I mean really hang. I've got access to the remote console (it's a supermicro MoBo with BMC on dedicated lan) and it's unresponsive, I've gone to the datacenter and verified with a keyboard and monitor that it's unresponsive. Now a bit for the background. I started with osol 111b or 2009.06 and the system worked just fine while I was using dCache for the storage virtualization and FDT for transfers. I transferred over 14TB over ca 30h to the locally set up zfs (zpool of 2x raid-z of 12 disks) and all was fine. The problems started when I installed hadoop and started to work in that environment. The machine itself is: - 2 x Intel E5520 2266MHz Quad-Core Xeon DP - Supermicro X8DTL-3F - 2 x 6GB 1333Mhz KVR1333D3E9SK3/6G - Areca ARC-1280ML, 24-port SATA-II HW RAID, PCI-E - Areca ARC-BBM Battery Backup Module - DIMM 2GB DDR2 PC5300 667Mhz, ECC for ARC-1280ML - Quad port ethernet Intel (e1000g driver) So to start debugging the odd behavior (btw reading files from hadoop by clients is just fine, no hang there) I first booted the machine with -k -v options and once it hanged I used F1-A to get to kmdb (boy do I consider myself lucky now in hindsight) and initiated a panic and dump. That dump I still have if anyone has interest or ideas. Once it booted up again the system reported some Intel CPU errors as can be seen here: http://pastebin.com/m2127ae0c going by this I had today the manufacturer replace the CPU-s, but alas it wasn't the cause (or I got a second batch of faulty ones, doubtful). While I was waiting for the new CPU-s I continued to debug trying to watch in mpstat, vmstat, iostat and so forth, but didn't notice anything really out of the ordinary happening just before the hang. During the file transfers the interrupts go up to ca 12k, the system calls up to 60k and context switches to 70k, then they fall down again and go up again as the blocks in hadoop arrive. I've seen hangs during high cs periods (mostly) and during low ones (a few times) so didn't find correlation there. However one steady thing that I discovered was that I no longer could break from the hang to kmdb. No matter how many times I tried F1-A on however many hangs I never got back to kmdb anymore. I've been able to get to kmdb on four other occasions, but always from an idle shell and even those were rare occurrences, in most cases F1-A doesn't seem to be doing anything. So I though maybe it's a hard hang and added set snooping=1 set snoop_interval=90000000 into /etc/system. The next hang I allowed the system some time, but nothing happened. In the first few attempts I didn't even have the snoop_interval defined so I assumed it should be 50s, but even letting the system hang for a full night didn't change anything. So I am a bit at loss on how to proceed as obviously a user level process (running under user hadoop even, not root) should not be able to trash the system as it is. I have in the meantime upgraded to snv_128a with no change in the behavior. I have not upgraded zfs as I was considering that maybe if I reinstall the machine to Solaris 10 then it'll be able to mount the zfs pools. Just to try if this changes anything. Namely I have three other nodes that are similar, but not exact. They are a bit older nodes so they feature E5420 cpu-s and a previous model of the motherboard, they also have a tick less memory and use 750GB disks. Those systems run Solaris 10 and no matter how much I transfer to them through hadoop nothing moves them. So this makes me believe that there are a few options that could be bad here: 1) a bug in OpenSolaris kernel (driver or not) 2) bad motherboard (hmm.... doubt a bit) 3) bad Areca controller 4) bad disks 5) bad memory options 2-5 I would exclude because the hang doesn't happen because of big amount of transfers as I can literally transfer tens of TB to it using other tools (which also are written in Java btw as is hadoop). So this leads me to think that this might be an osol bug, but beyond what I have done so far I can't really think of how to continue. The system should be in production already for a week, but I can't put it there as its main purpose is to run hadoop datanode and it can't handle that for longer than 5 minutes at the moment. Any ideas are welcome, I've mostly exhausted the ones I've collected on #opensolaris over the past four days (you can look up the IRC logs if you want, I'm mario_ there). If I forgot anything, let me know and I'll add info as it's requested. Things I remembered now, that I've tried: * Looked at ARC size, there were multiple GB still free and at times the ARC has been only ca 2.5GB while total memory is 12GB. * running truss on the hadoop java process, nothing out of the ordinary found at the end of it. The only thing that I did notice was that there were a lot of EAGAIN errors on sockets that after googling I understood come from it trying to get more data than the socket is currently providing. * I've tried redirecting the console to COM2 and connecting with ipmitool sol to it, that worked just fine, but when it hung and I sent ~B that even told me that it sent a break, then again, it didn't break me into kmdb. * The error happens 100% of the time, there are no log messages involved. The normal activities just stop and then there's boot. * The system does have trunking enabled on external interfaces, but the transfers happen on the internal one (without trunking) and I've tried also without the trunk, still hangs. * I've tried the NMI approach by adding set pcplusmp:apic_kmdb_on_nmi=1 to /etc/system and sending ipmitool chassis power diag to the system, but nothing happened. I tried the same ipmi command while in shell and never saw the NMI message that was supposed to be displayed according to the blog from where I got this idea so maybe the BMC doesn't pass it on... -- This message posted from opensolaris.org _______________________________________________ opensolaris-help mailing list [email protected]
