[
https://issues.apache.org/jira/browse/TRAFODION-2547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Arvind Narain reassigned TRAFODION-2547:
----------------------------------------
Assignee: Eason Zhang
> Daily 2.1 builds seeing leftover semaphore dev files after running
> db_uninstall.py
> ----------------------------------------------------------------------------------
>
> Key: TRAFODION-2547
> URL: https://issues.apache.org/jira/browse/TRAFODION-2547
> Project: Apache Trafodion
> Issue Type: Bug
> Components: installer
> Affects Versions: 2.1-incubating
> Environment: Release 2.1 with py installer.
> Reporter: Arvind Narain
> Assignee: Eason Zhang
>
> Noticed that after running the db_uninstall.py script (Release 2.1) we are
> always left with the device semaphore files. This is not the case when
> trafodion_uninstaller (master) is run.
> The leftover semaphore dev files in Daily run Release 2.1 cause problems with
> the next Daily run of the master branch.
> In case of cdh we don't see the failures in master builds because of the
> userid picked up by Release 2.1 python installer script is same as the master
> installer script (506). In case of HDP env this maybe different (1003).
> Though Steve is fixing the Jenkin jobs to clear out /dev/shm I think we have
> two issues:
> 1. db_uninstall.py in Release 2.1 does not stop all the trafodion
> processes - this may be due to recent checkin where ckillall is not being
> done. pkillall (called by ckillall) does handle all the trafodion processes
> as well as clears the semaphores.
> https://github.com/apache/incubator-trafodion/pull/991
> 2. monitor could be modified to create semaphores similar to rms or at least
> use userid instead of username.
> HDP job in Release 2.1:
> 2017-03-22 06:41:47 + ./python-installer/db_uninstall.py --verbose --silent
> --config-file ./Install_Config
> 2017-03-22 06:41:48 *****************************
> 2017-03-22 06:41:48 Trafodion Uninstall Start
> 2017-03-22 06:41:48 *****************************
> 2017-03-22 06:41:48
> 2017-03-22 06:41:48 [33m***[INFO]: Remove Trafodion on node [slave-ahw23]
> ... [0m
> 2017-03-22 06:41:48 *********************************
> 2017-03-22 06:41:48 Trafodion Uninstall Completed
> 2017-03-22 06:41:48 *********************************
> 2017-03-22 06:41:48 + uninst_ret=0
> 2017-03-22 06:41:48 + sudo rm -f
> /home/jenkins/workspace/pyodbc_test-hdp/traf_run
> 2017-03-22 06:41:48 + sudo mv
> /home/jenkins/workspace/pyodbc_test-hdp/traf_run.save
> /home/jenkins/workspace/pyodbc_test-hdp/traf_run
> 2017-03-22 06:41:48 + sudo chmod -R a+rX
> /home/jenkins/workspace/pyodbc_test-hdp/traf_run
> 2017-03-22 06:41:48 + exit 0
> 2017-03-22 06:41:48 + rc=0
> 2017-03-22 06:41:48 + echo 'Checking shared mem'
> 2017-03-22 06:41:48 Checking shared mem
> 2017-03-22 06:41:48 + ls -ld /dev/shm
> 2017-03-22 06:41:48 drwxrwxrwt 2 root root 100 Mar 22 06:35 /dev/shm
> 2017-03-22 06:41:48 + ls -l /dev/shm
> 2017-03-22 06:41:48 total 12
> 2017-03-22 06:41:48 -rw-r--r-- 1 1003 509 32 Mar 22 06:34
> sem.monitor.sem.trafodion
> 2017-03-22 06:41:48 -rw------- 1 1003 509 32 Mar 22 06:35
> sem.rms.1003.268469813
> 2017-03-22 06:41:48 -rw------- 1 1003 509 32 Mar 22 06:35
> sem.rms.1003.268477888
> 2017-03-22 06:41:48 + echo ============
> 2017-03-22 06:41:48 ============
> 2017-03-22 06:41:48 + exit 0
> 2017-03-22 06:41:48 + '[' 0 -ne 0 ']'
> 2017-03-22 06:41:48 + exit 0
> CDH job in Release 2.1:
> 2017-03-22 07:38:28 + ./python-installer/db_uninstall.py --verbose --silent
> --config-file ./Install_Config
> 2017-03-22 07:38:28 *****************************
> 2017-03-22 07:38:28 Trafodion Uninstall Start
> 2017-03-22 07:38:28 *****************************
> 2017-03-22 07:38:28
> 2017-03-22 07:38:28 [33m***[INFO]: Remove Trafodion on node [slave-cm54] ...
> [0m
> 2017-03-22 07:38:28 *********************************
> 2017-03-22 07:38:28 Trafodion Uninstall Completed
> 2017-03-22 07:38:28 *********************************
> 2017-03-22 07:38:28 + uninst_ret=0
> 2017-03-22 07:38:28 + sudo rm -f
> /home/jenkins/workspace/pyodbc_test-cdh/traf_run
> 2017-03-22 07:38:28 + sudo mv
> /home/jenkins/workspace/pyodbc_test-cdh/traf_run.save
> /home/jenkins/workspace/pyodbc_test-cdh/traf_run
> 2017-03-22 07:38:28 + sudo chmod -R a+rX
> /home/jenkins/workspace/pyodbc_test-cdh/traf_run
> 2017-03-22 07:38:28 + exit 0
> 2017-03-22 07:38:28 + rc=0
> 2017-03-22 07:38:28 + echo 'Checking shared mem'
> 2017-03-22 07:38:28 Checking shared mem
> 2017-03-22 07:38:28 + ls -ld /dev/shm
> 2017-03-22 07:38:28 drwxrwxrwt 2 root root 100 Mar 22 07:33 /dev/shm
> 2017-03-22 07:38:28 + ls -l /dev/shm
> 2017-03-22 07:38:28 total 12
> 2017-03-22 07:38:28 -rw-r--r-- 1 506 507 32 Mar 22 07:33
> sem.monitor.sem.trafodion
> 2017-03-22 07:38:28 -rw------- 1 506 507 32 Mar 22 07:33 sem.rms.506.268474535
> 2017-03-22 07:38:28 -rw------- 1 506 507 32 Mar 22 07:33 sem.rms.506.268480568
> 2017-03-22 07:38:28 + echo ============
> 2017-03-22 07:38:28 ============
> 2017-03-22 07:38:28 + exit 0
> 2017-03-22 07:38:28 + '[' 0 -ne 0 ']'
> 2017-03-22 07:38:28 + exit 0
> ===
> Previous emails:
> From: Selva Govindarajan [mailto:[email protected]]
> Sent: Tuesday, March 21, 2017 12:25 PM
> To: [email protected]
> Cc: Steve Varnau <[email protected]>
> Subject: Re: Trafodion Maser daily build failures
> Thanks Arvind and Steve for following it up. I had said RMS uses port number.
> Actually, the segment id is obtained from the foundation layer and used in
> the semaphore name.
> SEG_ID getStatsSegmentId()
> {
> Int32 segid;
> Int32 error;
> if (gStatsSegmentId_ == -1)
> {
> error = msg_mon_get_my_segid(&segid);
> assert(error == 0);//XZFIL_ERR_OK);
> gStatsSegmentId_ = segid + RMS_SEGMENT_ID_OFFSET;
> }
> return gStatsSegmentId_;
> }
> RMS gets it once and stores the created semaphore name for use later. I think
> process Id can also be used in case of monitor because the semaphore is valid
> only as long as the monitor is alive. In case of RMS, semaphore name needs to
> remain the same even RMS processes are restarted as long as the node is UP.
> Selva
> Selva
> ________________________________
> From: Arvind N <[email protected]>
> Sent: Tuesday, March 21, 2017 12:03:22 PM
> To: [email protected]
> Cc: Steve Varnau; Selva Govindarajan
> Subject: RE: Trafodion Maser daily build failures
> Steve modified the scripts to print out the contents of /dev/shm before
> install and after uninstall. As per the following it does seem that it is a
> leftover semaphore in /dev/shm from previous build.
> Did notice that the failures are restricted to hdp environment. Happens in
> an environment where the slave system was first used by a daily build for
> Release2.1 (leaves files in /dev/shm for id 1003) and then the same is used
> for daily build for master. Maybe the logic of finding the next available id
> is different in the py installer vs bash installer ?
> Suggestion from Selva to attach process ID to the semaphore name should
> clear this problem.
> From master daily build:
> https://jenkins.esgyn.com/job/core-regress-privs1-hdp/505/console
> AHW 2.3 (i-014c7dcfa0719ec26)
> 2017-03-21 09:18:58 === Tue Mar 21 09:18:58 UTC 2017:
> /usr/local/bin/install-traf.sh
> 2017-03-21 09:18:58 === Setting up Trafodion
> 2017-03-21 09:18:58
> ========================================================
> 2017-03-21 09:18:58 Source
> /home/jenkins/workspace/core-regress-privs1-hdp/trafodion/core/sqf/conf/inst
> all_features
> 2017-03-21 09:18:58 Java for Trafodion install:
> /usr/lib/jvm/java-1.7.0-openjdk.x86_64
> 2017-03-21 09:18:58 Saving output in Install_Start.log
> 2017-03-21 09:18:58 + chmod o+r
> /home/jenkins/workspace/core-regress-privs1-hdp/trafodion/distribution/apach
> e-trafodion_installer-2.2.0-incubating.tar.gz
> /home/jenkins/workspace/core-regress-privs1-hdp/trafodion/distribution/apach
> e-trafodion_server-2.2.0-RH6-x86_64-incubating.tar.gz
> /home/jenkins/workspace/core-regress-privs1-hdp/trafodion/distribution/apach
> e-trafodion-regress.tgz
> 2017-03-21 09:18:58 + echo 'Checking shared mem'
> 2017-03-21 09:18:58 Checking shared mem
> 2017-03-21 09:18:58 + ls -ld /dev/shm
> 2017-03-21 09:18:58 drwxrwxrwt 2 root root 100 Mar 21 09:18
> /dev/shm
> 2017-03-21 09:18:58 + ls -l /dev/shm
> 2017-03-21 09:18:58 total 12
> 2017-03-21 09:18:58 -rw-r--r-- 1 1003 509 32 Mar 21 08:03
> sem.monitor.sem.trafodion
> 2017-03-21 09:18:58 -rw------- 1 1003 509 32 Mar 21 08:03
> sem.rms.1003.268468606
> 2017-03-21 09:18:58 -rw------- 1 1003 509 32 Mar 21 08:03
> sem.rms.1003.268490614
> 2017-03-21 09:18:58 + echo ============
> 2017-03-21 09:18:58 ============
> Leftover from the Release 2.1 build:
> https://jenkins.esgyn.com/job/phoenix_part2_T4-hdp/580/consoleFull - 2.1
> build
> 2017-03-21 09:16:05 *********************************
> 2017-03-21 09:16:05 Trafodion Uninstall Completed
> 2017-03-21 09:16:05 *********************************
> 2017-03-21 09:16:05 + uninst_ret=0
> 2017-03-21 09:16:05 + sudo rm -f
> /home/jenkins/workspace/phoenix_part2_T4-hdp/traf_run
> 2017-03-21 09:16:05 + sudo mv
> /home/jenkins/workspace/phoenix_part2_T4-hdp/traf_run.save
> /home/jenkins/workspace/phoenix_part2_T4-hdp/traf_run
> 2017-03-21 09:16:05 + sudo chmod -R a+rX
> /home/jenkins/workspace/phoenix_part2_T4-hdp/traf_run
> 2017-03-21 09:16:05 + exit 0
> 2017-03-21 09:16:05 + rc=0
> 2017-03-21 09:16:05 + echo 'Checking shared mem'
> 2017-03-21 09:16:05 Checking shared mem
> 2017-03-21 09:16:05 + ls -ld /dev/shm
> 2017-03-21 09:16:05 drwxrwxrwt 2 root root 100 Mar 21 09:15
> /dev/shm
> 2017-03-21 09:16:05 + ls -l /dev/shm
> 2017-03-21 09:16:05 total 12
> 2017-03-21 09:16:05 -rw-r--r-- 1 1003 509 32 Mar 21 08:03
> sem.monitor.sem.trafodion
> 2017-03-21 09:16:05 -rw------- 1 1003 509 32 Mar 21 08:03
> sem.rms.1003.268468606
> 2017-03-21 09:16:05 -rw------- 1 1003 509 32 Mar 21 08:03
> sem.rms.1003.268490614
> 2017-03-21 09:16:05 + echo ============
> 2017-03-21 09:16:05 ============
> 2017-03-21 09:16:05 + exit 0
> 2017-03-21 09:16:05 + exit 0
> Regards
> Arvind
> -----Original Message-----
> From: Narendra Goyal [mailto:[email protected]]
> Sent: Friday, March 17, 2017 2:39 PM
> To: [email protected]
> Subject: RE: Trafodion Maser daily build failures
> Checked the /dev/shm directory on the build machine and that was empty. I
> was able to create a file /dev/shm/foo (as the 'trafodion' user id) - so,
> does not look like a permissions issue (on /dev/shm at least).
> I am not sure whether any build has happened on that build machine but do
> not see any orphan semaphore in /dev/shm.
> Thanks,
> -Narendra
> -----Original Message-----
> From: Selva Govindarajan [mailto:[email protected]]
> Sent: Friday, March 17, 2017 11:07 AM
> To: [email protected]
> <mailto:[email protected]>
> Subject: Trafodion Maser daily build failures
> First, I changed the subject line so that this message doesn't get filtered
> out. Trafodion master daily build has been failing randomly with the
> following stack trace in monitor.
> (gdb) bt
> #0 0x00007feaee0eb625 in raise () from /lib64/libc.so.6
> #1 0x00007feaee0ece05 in abort () from /lib64/libc.so.6
> #2 0x000000000041f8b3 in CProcessContainer::CProcessContainer
> (this=0x270e340, nodeContainer=<value optimized out>) at process.cxx:3389
> #3 0x00000000004569cc in CNode::CNode (this=0x270e340, name=0x26e9548
> "slave-ahw23", pnid=0, rank=0) at pnode.cxx:152
> #4 0x0000000000458050 in CNodeContainer::AddNodes (this=<value optimized
> out>) at pnode.cxx:1572
> #5 0x0000000000419185 in CCluster::InitializeConfigCluster (this=0x2712270)
> at cluster.cxx:2818
> #6 0x0000000000419e25 in CCluster::CCluster (this=0x2712270) at
> cluster.cxx:597
> #7 0x000000000043473a in CTmSync_Container::CTmSync_Container
> (this=0x2712270) at tmsync.cxx:137
> #8 0x0000000000408f36 in CMonitor::CMonitor (this=0x2712270, procTermSig=9)
> at monitor.cxx:329
> #9 0x000000000040a5ab in main (argc=2, argv=0x7ffd157c0b48) at
> monitor.cxx:1308
> (gdb) f 2
> The monitor log shows
> 2017-03-16 09:21:48,327, INFO, MON, Node Number: 0,, PIN: 17918 , Process
> Name: $MONITOR,,, TID: 17918, Message ID: 101020103, [CMonitor::main],
> monitor Version 1.0.1 prodver Release 2.2.0 (Build release
> [2.0.1rc3-1425-g6155ff1_Bld883], branch 6155ff1_no_branch, date
> 20170316_0832), Started! CommType: Sockets
> 2017-03-16 09:21:48,327, INFO, MON, Node Number: 0,, PIN: 17918 , Process
> Name: $MONITOR,,, TID: 17918, Message ID: 101010401, [CCluster::CCluster]
> Validation of node down is disabled
> 2017-03-16 09:21:48,328, ERROR, MON, Node Number: 0,, PIN: 17918 , Process
> Name: $MONITOR,,, TID: 17918, Message ID: 101030703,
> [CProcessContainer::CProcessContainer], Can't create semaphore
> /monitor.sem.trafodion! (Permission denied)
> 2017-03-16 09:21:48,328, ERROR, MON, Node Number: 0,, PIN: 17918 , Process
> Name: $MONITOR,,, TID: 17918, Message ID: 101030704,
> [CProcessContainer::CProcessContainer], Can't unlink semaphore
> /monitor.sem.trafodion! (Permission denied)
> I came up with the following theory
> When a semaphore is created, a device file with the given semaphore name is
> created at /dev/shm by the process. The process owner needs to have write
> permission to create this file. Initially I suspected it is permission
> issue of /dev/shm directory.
> I just looked at /dev/shm in the Jenkins VM. It did have the write
> permission.
> If that's the case, it is possible the previous semaphore was not cleaned
> up correctly. The monitor seems to create the semaphore with
> /dev/shm/sem.monitor.<user_name>. If trafodion gets the different uid
> between two different runs, it is possible that it is unable to clean it up.
> In case of RMS, we attach the port number to the semaphore name so that
> every run from the same user name will get a different semaphore name.
> ---------------------
> sem_open document shows
> EACCES The semaphore exists, but the caller does not have permission
> to open it
> EACCES is 13 the errno returned in the gdb.
> Please offer your help to resolve this issue if you have any other idea.
> Selva
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)