Hi, Narendra,
Looks like this problem was solved after I applied ur recommended settings, but
I am not sure why in core/sql/runtimestats/SqlStats.cpp:276:
prevHeap = statsArray_[pid].processStats_->getHeap();
(gdb) p statsArray_
$1 = (GlobalStatsArray *) 0x10003290
(gdb) p pid
$2 = 65536
I am not sure why the stats array use pid as its index, :), anyway, my problem
was solved, I am able to go ahead now.
Thanks a lot.
-----邮件原件-----
发件人: Narendra Goyal [mailto:[email protected]]
发送时间: 2015年9月8日 10:04
收件人: [email protected]
抄送: Lijian (Q)
主题: RE: [Urgent Help] Trafodion Build Environment Problem
Hi Nieyuanyuan,
Could you please check the 'pid_max' settings:
sysctl -q kernel.pid_max
(or cat /proc/sys/kernel/pid_max)
If the value is > 64K, I would recommend you set it to 64K, like so:
sudo sysctl -w kernel.pid_max=65535
You will have to restart Tradfodion and other Hadoop/HBase processes:
swstopall
ckillall
swstartall
sqstart
Just fyi, to check the list of Trafodion processes only, please run 'cstat' on
your bash.
Thanks,
-Narendra
-----Original Message-----
From: Nieyuanyuan [mailto:[email protected]]
Sent: Monday, September 7, 2015 6:40 PM
To: [email protected]
Cc: Lijian (Q) <[email protected]>
Subject: [Urgent Help] Trafodion Build Environment Problem
Dear Guys,
I recently downloaded trafodion 1.1 from
https://github.com/apache/incubator-trafodion/tree/stable/1.1, and followed the
build guide from
https://wiki.trafodion.org/wiki/index.php/Building_the_Software, and solved a
lot of problems (no need to list all details), I am able to run trafodion over
a hadoop sandbox environment.
But I got a serious problem, that is, all Trafodion related process will go
down after several minutes (not sure how long), only few of them will
left:
[nieyy@redhat-72 ~]$ ps ux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
nieyy 76554 0.1 0.1 590988 139768 pts/6 Sl 19:14 0:04
/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java
-XX:OnOutOfMemoryError=kill -9 %p -Xmx128m
nieyy 118833 0.7 0.3 1535452 420996 ? Sl 19:40 0:12
/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java -Dproc_namenode
-Xmx1000m -Djava.net.prefe
nieyy 119085 0.6 0.2 1572688 367388 ? Sl 19:40 0:10
/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java -Dproc_datanode
-Xmx1000m -Djava.net.prefe
nieyy 119320 0.4 0.2 1512656 340636 ? Sl 19:41 0:07
/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java
-Dproc_secondarynamenode -Xmx1000m -Djava.
nieyy 119972 1.2 0.2 1708408 378536 pts/6 Sl 19:41 0:20
/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java
-Dproc_resourcemanager -Xmx1000m -Dhadoop.
nieyy 120133 0.9 0.2 1616388 309976 ? Sl 19:41 0:16
/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java
-Dproc_nodemanager -Xmx1000m -Dhadoop.log.
nieyy 120371 0.0 0.0 9824 1772 pts/6 S 19:41 0:00 /bin/sh
./bin/mysqld_safe
--defaults-file=/home/nieyy/trafodion_build/incubator-trafodion-stable-1.
nieyy 120594 0.0 0.0 452604 89908 pts/6 Sl 19:41 0:01
/home/nieyy/trafodion_build/incubator-trafodion-stable-1.1/core/sqf/sql/lo
cal_hadoop/mysql/bin/mysq
nieyy 120789 0.0 0.0 9692 1736 pts/6 S 19:41 0:00 bash
/home/nieyy/trafodion_build/incubator-trafodion-stable-1.1/core/sqf/sql/lo
cal_hadoop/hbase/bin
nieyy 120806 2.0 0.3 1809048 509164 pts/6 Sl 19:41 0:34
/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java -Dproc_master
-XX:OnOutOfMemoryError=kill
nieyy 122554 0.0 0.0 13624 1304 pts/6 S 19:41 0:00 mpirun
-disable-auto-cleanup -demux select -env SQ_IC TCP -env MPI_ERROR_LEVEL 2 -env
SQ_PIDMAP 1 -
nieyy 122555 0.0 0.0 0 0 ? Zs 19:41 0:00
[hydra_pmi_proxy] <defunct>
nieyy 122556 1.0 0.0 335212 36748 ? Ssl 19:41 0:17
/home/nieyy/trafodion_build/incubator-trafodion-stable-1.1/core/sqf/export
/bin64d/monitor COLD
nieyy 122557 0.8 0.0 335212 36768 ? Ssl 19:41 0:14
/home/nieyy/trafodion_build/incubator-trafodion-stable-1.1/core/sqf/export
/bin64d/monitor COLD
nieyy 123946 0.9 0.1 828072 223088 pts/6 Sl 19:42 0:14
/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java
-XX:OnOutOfMemoryError=kill -9 %p -Xmx128m
nieyy 124044 1.0 0.1 629200 187180 pts/6 Sl 19:42 0:16
/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java
-XX:OnOutOfMemoryError=kill -9 %p -Xmx128m
And then I need to kill all processes and use swstartall and sqstart to reset
the environment, however, the environment will still go down after a while, and
I need to restart again.
I found some cores under
trafodion_build/incubator-trafodion-stable-1.1/core/sqf/sql/scripts, all cored
were generated by mxssmp:
[nieyy@redhat-72 scripts]$ ll core*
...
-rw------- 1 nieyy nieyy 156008448 Sep 7 17:56 core.mxssmp.173357
-rw------- 1 nieyy nieyy 145518592 Sep 7 17:56 core.mxssmp.173372
-rw------- 1 nieyy nieyy 156008448 Sep 7 19:24 core.mxssmp.74146
-rw------- 1 nieyy nieyy 145518592 Sep 7 19:24 core.mxssmp.74197
I used gdb to track the stack:
[nieyy@redhat-72 scripts]$ gdb
/home/nieyy/trafodion_build/incubator-trafodion-stable-1.1/core/sql/lib/li
nux/64bit/debug/mxssmp ./core.mxssmp.141469 ...
(gdb) where
#0 0x000000000044166c in ProcessStats::getHeap (this=0x2000) at
../runtimestats/SqlStats.h:271
#1 0x000000000043990a in StatsGlobals::removeProcess (this=0x10000000,
pid=65536, calledAtAdd=0) at ../runtimestats/SqlStats.cpp:276
#2 0x0000000000439e05 in StatsGlobals::checkForDeadProcesses
(this=0x10000000, myPid=141469) at ../runtimestats/SqlStats.cpp:382
#3 0x00000000004440be in SsmpGlobals::work (this=0x7f062660c7e8) at
../runtimestats/ssmpipc.cpp:582
#4 0x000000000042f06a in runServer (argc=1, argv=0x7fff5b0e5a48) at
../bin/ex_ssmp_main.cpp:259
#5 0x000000000042eb12 in main (argc=1, argv=0x7fff5b0e5a48) at
../bin/ex_ssmp_main.cpp:127
Then I searched via Google, and found a link
https://bugs.launchpad.net/trafodion/+bug/1368891 which looks similar, but it
claimed the bug has been fixed at v0.9, but my version is 1.1.
So, could you kindly help me to solve this problem cause I can't find more
useful information via Google.
Thanks a lot.