Hi, At some point a node from the 5 nodes cluster has stopped and we needed to restart it, After that I've restarted all the ambari and hdp services but trafodion fails to start.
Bellow are some stack traces and details for files that I'm not getting any stack. Files are from node1 and node2 and were in Oct 2 (when I think node 2 was down) and Oct 6 (when re rebooted the node and tried to start trafodion). Feel free to connect and debug the issue on our cluster, Amanda has the credentials. *FROM NODE1* Oct 2 22:27 core.39347 core.39347: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'tm SQMON1.1 00000 00000 039347 $TM0 188.138.61.175:60186 00002 00000 00009 SPAR' gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv core.39347 no stack Oct 2 22:41 core.15144 Program terminated with signal 6, Aborted. #0 0x00007f77bcbbb625 in ?? () #1 0x00007f77bcbbce05 in ?? () #2 0x0000000000000010 in ?? () at ../common/Collections.cpp:109 #3 0x00007f77bee62130 in ?? () #4 0x00007ffe8e796ec0 in ?? () #5 0x00007f77bdeced00 in ?? () #6 0x0000000000000004 in ?? () at ../common/Collections.cpp:109 #7 0x0000000001b3a310 in ?? () #8 0x0000000000000000 in ?? () Oct 2 22:41 core.39240 #0 0x00007f534d03c625 in raise () from /lib64/libc.so.6 #1 0x00007f534d03de05 in abort () from /lib64/libc.so.6 #2 0x00007f534d03574e in __assert_fail_base () from /lib64/libc.so.6 #3 0x00007f534d035810 in __assert_fail () from /lib64/libc.so.6 #4 0x000000000046e213 in CExtTmLeaderReq::performRequest (this=0x7f53340008c0) at reqtmleader.cxx:126 #5 0x000000000045a64a in CReqWorker::reqWorkerThread (this=<value optimized out>) at reqworker.cxx:79 #6 0x000000000045a86d in reqWorker (arg=0xc6f9a0) at reqworker.cxx:147 #7 0x00007f534db45a51 in start_thread () from /lib64/libpthread.so.0 #8 0x00007f534d0f29ad in clone () from /lib64/libc.so.6 Oct 2 22:41 core.15309 core.15309: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'tm SQMON1.1 00000 00000 015309 $TM0 188.138.61.175:60186 00002 00000 00134 SPAR' gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv core.15309 no stack *FROM NODE2* Oct 2 22:29 core.39491 core.39491: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'tm SQMON1.1 00001 00001 039491 $TM1 188.138.61.177:38680 00002 00001 00003 SPAR' gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv core.39491 no stack Oct 6 15:23 core.1394 Program terminated with signal 6, Aborted. #0 0x00007fb97acbf625 in raise () from /lib64/libc.so.6 #1 0x00007fb97acc0e05 in abort () from /lib64/libc.so.6 #2 0x000000000041d07d in CProcessContainer::CProcessContainer (this=0x2071880, nodeContainer=<value optimized out>) at process.cxx:3366 #3 0x0000000000453f5c in CNode::CNode (this=0x2071880, name=0x204c448 "euve79672", pnid=0, rank=0) at pnode.cxx:153 #4 0x00000000004558e0 in CNodeContainer::AddNodes (this=<value optimized out>) at pnode.cxx:1564 #5 0x00000000004169a5 in CCluster::InitializeConfigCluster (this=0x20757b0) at cluster.cxx:2740 #6 0x0000000000417645 in CCluster::CCluster (this=0x20757b0) at cluster.cxx:567 #7 0x0000000000431e1a in CTmSync_Container::CTmSync_Container (this=0x20757b0) at tmsync.cxx:137 #8 0x0000000000407bb6 in CMonitor::CMonitor (this=0x20757b0, procTermSig=9) at monitor.cxx:323 #9 0x00000000004086ad in main (argc=2, argv=0x7fff8322e298) at monitor.cxx:1152 Oct 6 15:43 core.17626 Program terminated with signal 6, Aborted. #0 0x00007fcf11aea625 in raise () from /lib64/libc.so.6 #1 0x00007fcf11aebe05 in abort () from /lib64/libc.so.6 #2 0x000000000041d07d in CProcessContainer::CProcessContainer (this=0x1182890, nodeContainer=<value optimized out>) at process.cxx:3366 #3 0x0000000000453f5c in CNode::CNode (this=0x1182890, name=0x115d458 "euve79672", pnid=0, rank=0) at pnode.cxx:153 #4 0x00000000004558e0 in CNodeContainer::AddNodes (this=<value optimized out>) at pnode.cxx:1564 #5 0x00000000004169a5 in CCluster::InitializeConfigCluster (this=0x11867c0) at cluster.cxx:2740 #6 0x0000000000417645 in CCluster::CCluster (this=0x11867c0) at cluster.cxx:567 #7 0x0000000000431e1a in CTmSync_Container::CTmSync_Container (this=0x11867c0) at tmsync.cxx:137 #8 0x0000000000407bb6 in CMonitor::CMonitor (this=0x11867c0, procTermSig=9) at monitor.cxx:323 #9 0x00000000004086ad in main (argc=2, argv=0x7ffcaca91f68) at monitor.cxx:1152 -- And in the end, it's not the years in your life that count. It's the life in your years.
