[ 
https://issues.apache.org/jira/browse/TRAFODION-2547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvind Narain reassigned TRAFODION-2547:
----------------------------------------

    Assignee: Eason Zhang

> Daily 2.1 builds seeing leftover semaphore dev files after running 
> db_uninstall.py
> ----------------------------------------------------------------------------------
>
>                 Key: TRAFODION-2547
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-2547
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: installer
>    Affects Versions: 2.1-incubating
>         Environment: Release 2.1 with py installer.
>            Reporter: Arvind Narain
>            Assignee: Eason Zhang
>
> Noticed that after running the db_uninstall.py script (Release 2.1) we are 
> always left with the device semaphore files. This is not the case when 
> trafodion_uninstaller (master) is run.
> The leftover semaphore dev files in Daily run Release 2.1 cause problems with 
> the next Daily run of the master branch.
> In case of cdh we don't see the failures in master builds because of the 
> userid picked up by Release 2.1 python installer script is same as the master 
> installer script (506). In case of HDP env this maybe different (1003).
> Though Steve is fixing the Jenkin jobs to clear out /dev/shm I think we have 
> two issues:
> 1.    db_uninstall.py in Release 2.1 does not stop all the trafodion 
> processes - this may be due to recent checkin where ckillall is not being 
> done. pkillall (called by ckillall) does handle all the trafodion processes 
> as well as clears the semaphores.
> https://github.com/apache/incubator-trafodion/pull/991
> 2. monitor could be modified to create semaphores similar to rms or at least 
> use userid instead of username.
> HDP  job in Release 2.1:
> 2017-03-22 06:41:47 + ./python-installer/db_uninstall.py --verbose --silent 
> --config-file ./Install_Config
> 2017-03-22 06:41:48 *****************************
> 2017-03-22 06:41:48   Trafodion Uninstall Start
> 2017-03-22 06:41:48 *****************************
> 2017-03-22 06:41:48 
> 2017-03-22 06:41:48  [33m***[INFO]: Remove Trafodion on node [slave-ahw23] 
> ...  [0m
> 2017-03-22 06:41:48 *********************************
> 2017-03-22 06:41:48   Trafodion Uninstall Completed
> 2017-03-22 06:41:48 *********************************
> 2017-03-22 06:41:48 + uninst_ret=0
> 2017-03-22 06:41:48 + sudo rm -f 
> /home/jenkins/workspace/pyodbc_test-hdp/traf_run
> 2017-03-22 06:41:48 + sudo mv 
> /home/jenkins/workspace/pyodbc_test-hdp/traf_run.save 
> /home/jenkins/workspace/pyodbc_test-hdp/traf_run
> 2017-03-22 06:41:48 + sudo chmod -R a+rX 
> /home/jenkins/workspace/pyodbc_test-hdp/traf_run
> 2017-03-22 06:41:48 + exit 0
> 2017-03-22 06:41:48 + rc=0
> 2017-03-22 06:41:48 + echo 'Checking shared mem'
> 2017-03-22 06:41:48 Checking shared mem
> 2017-03-22 06:41:48 + ls -ld /dev/shm
> 2017-03-22 06:41:48 drwxrwxrwt 2 root root 100 Mar 22 06:35 /dev/shm
> 2017-03-22 06:41:48 + ls -l /dev/shm
> 2017-03-22 06:41:48 total 12
> 2017-03-22 06:41:48 -rw-r--r-- 1 1003 509 32 Mar 22 06:34 
> sem.monitor.sem.trafodion
> 2017-03-22 06:41:48 -rw------- 1 1003 509 32 Mar 22 06:35 
> sem.rms.1003.268469813
> 2017-03-22 06:41:48 -rw------- 1 1003 509 32 Mar 22 06:35 
> sem.rms.1003.268477888
> 2017-03-22 06:41:48 + echo ============
> 2017-03-22 06:41:48 ============
> 2017-03-22 06:41:48 + exit 0
> 2017-03-22 06:41:48 + '[' 0 -ne 0 ']'
> 2017-03-22 06:41:48 + exit 0
> CDH job in Release 2.1:
> 2017-03-22 07:38:28 + ./python-installer/db_uninstall.py --verbose --silent 
> --config-file ./Install_Config
> 2017-03-22 07:38:28 *****************************
> 2017-03-22 07:38:28   Trafodion Uninstall Start
> 2017-03-22 07:38:28 *****************************
> 2017-03-22 07:38:28 
> 2017-03-22 07:38:28  [33m***[INFO]: Remove Trafodion on node [slave-cm54] ... 
>  [0m
> 2017-03-22 07:38:28 *********************************
> 2017-03-22 07:38:28   Trafodion Uninstall Completed
> 2017-03-22 07:38:28 *********************************
> 2017-03-22 07:38:28 + uninst_ret=0
> 2017-03-22 07:38:28 + sudo rm -f 
> /home/jenkins/workspace/pyodbc_test-cdh/traf_run
> 2017-03-22 07:38:28 + sudo mv 
> /home/jenkins/workspace/pyodbc_test-cdh/traf_run.save 
> /home/jenkins/workspace/pyodbc_test-cdh/traf_run
> 2017-03-22 07:38:28 + sudo chmod -R a+rX 
> /home/jenkins/workspace/pyodbc_test-cdh/traf_run
> 2017-03-22 07:38:28 + exit 0
> 2017-03-22 07:38:28 + rc=0
> 2017-03-22 07:38:28 + echo 'Checking shared mem'
> 2017-03-22 07:38:28 Checking shared mem
> 2017-03-22 07:38:28 + ls -ld /dev/shm
> 2017-03-22 07:38:28 drwxrwxrwt 2 root root 100 Mar 22 07:33 /dev/shm
> 2017-03-22 07:38:28 + ls -l /dev/shm
> 2017-03-22 07:38:28 total 12
> 2017-03-22 07:38:28 -rw-r--r-- 1 506 507 32 Mar 22 07:33 
> sem.monitor.sem.trafodion
> 2017-03-22 07:38:28 -rw------- 1 506 507 32 Mar 22 07:33 sem.rms.506.268474535
> 2017-03-22 07:38:28 -rw------- 1 506 507 32 Mar 22 07:33 sem.rms.506.268480568
> 2017-03-22 07:38:28 + echo ============
> 2017-03-22 07:38:28 ============
> 2017-03-22 07:38:28 + exit 0
> 2017-03-22 07:38:28 + '[' 0 -ne 0 ']'
> 2017-03-22 07:38:28 + exit 0
> ===
> Previous emails:
> From: Selva Govindarajan [mailto:[email protected]] 
> Sent: Tuesday, March 21, 2017 12:25 PM
> To: [email protected]
> Cc: Steve Varnau <[email protected]>
> Subject: Re: Trafodion Maser daily build failures
> Thanks Arvind and Steve for following it up. I had said RMS uses port number. 
>  Actually,  the segment id is obtained from the foundation layer and used in 
> the semaphore name.
> SEG_ID getStatsSegmentId()
> {
>   Int32 segid;
>   Int32 error;
>   if (gStatsSegmentId_ == -1)
>   {
>    error = msg_mon_get_my_segid(&segid);
>    assert(error == 0);//XZFIL_ERR_OK);
>    gStatsSegmentId_ = segid + RMS_SEGMENT_ID_OFFSET;
>   }
>   return gStatsSegmentId_;
> }
> RMS gets it once and stores the created semaphore name for use later. I think 
> process Id can also be used in case of monitor because the semaphore is valid 
> only as long as the monitor is alive. In case of RMS, semaphore name needs to 
> remain the same even RMS processes are restarted as long as the node is UP.
> Selva
> Selva
> ________________________________
> From: Arvind N <[email protected]>
> Sent: Tuesday, March 21, 2017 12:03:22 PM
> To: [email protected]
> Cc: Steve Varnau; Selva Govindarajan
> Subject: RE: Trafodion Maser daily build failures
> Steve modified the scripts to print out the contents of /dev/shm before
> install and after uninstall. As per the following it does seem that it is a
> leftover semaphore in /dev/shm from previous build.
> Did notice that the failures are restricted to hdp environment. Happens in
> an environment where the slave system was first used by a daily build for
> Release2.1 (leaves files in /dev/shm for id 1003) and then the same is used
> for daily build for master. Maybe the logic of finding the next available id
> is different in the py installer vs bash installer ?
> Suggestion from Selva to attach process ID to the semaphore name should
> clear this problem.
>                 From master daily build:
> https://jenkins.esgyn.com/job/core-regress-privs1-hdp/505/console
>                  AHW 2.3 (i-014c7dcfa0719ec26)
>                 2017-03-21 09:18:58 === Tue Mar 21 09:18:58 UTC 2017:
> /usr/local/bin/install-traf.sh
>                 2017-03-21 09:18:58 === Setting up Trafodion
>                 2017-03-21 09:18:58
> ========================================================
>                 2017-03-21 09:18:58 Source
> /home/jenkins/workspace/core-regress-privs1-hdp/trafodion/core/sqf/conf/inst
> all_features
>                 2017-03-21 09:18:58 Java for Trafodion install:
> /usr/lib/jvm/java-1.7.0-openjdk.x86_64
>                 2017-03-21 09:18:58 Saving output in Install_Start.log
>                 2017-03-21 09:18:58 + chmod o+r
> /home/jenkins/workspace/core-regress-privs1-hdp/trafodion/distribution/apach
> e-trafodion_installer-2.2.0-incubating.tar.gz
> /home/jenkins/workspace/core-regress-privs1-hdp/trafodion/distribution/apach
> e-trafodion_server-2.2.0-RH6-x86_64-incubating.tar.gz
> /home/jenkins/workspace/core-regress-privs1-hdp/trafodion/distribution/apach
> e-trafodion-regress.tgz
>                 2017-03-21 09:18:58 + echo 'Checking shared mem'
>                 2017-03-21 09:18:58 Checking shared mem
>                 2017-03-21 09:18:58 + ls -ld /dev/shm
>                 2017-03-21 09:18:58 drwxrwxrwt 2 root root 100 Mar 21 09:18
> /dev/shm
>                 2017-03-21 09:18:58 + ls -l /dev/shm
>                 2017-03-21 09:18:58 total 12
>                 2017-03-21 09:18:58 -rw-r--r-- 1 1003 509 32 Mar 21 08:03
> sem.monitor.sem.trafodion
>                 2017-03-21 09:18:58 -rw------- 1 1003 509 32 Mar 21 08:03
> sem.rms.1003.268468606
>                 2017-03-21 09:18:58 -rw------- 1 1003 509 32 Mar 21 08:03
> sem.rms.1003.268490614
>                 2017-03-21 09:18:58 + echo ============
>                 2017-03-21 09:18:58 ============
>                 Leftover from the Release 2.1 build:
> https://jenkins.esgyn.com/job/phoenix_part2_T4-hdp/580/consoleFull - 2.1
> build
>                 2017-03-21 09:16:05 *********************************
>                 2017-03-21 09:16:05   Trafodion Uninstall Completed
>                 2017-03-21 09:16:05 *********************************
>                 2017-03-21 09:16:05 + uninst_ret=0
>                 2017-03-21 09:16:05 + sudo rm -f
> /home/jenkins/workspace/phoenix_part2_T4-hdp/traf_run
>                 2017-03-21 09:16:05 + sudo mv
> /home/jenkins/workspace/phoenix_part2_T4-hdp/traf_run.save
> /home/jenkins/workspace/phoenix_part2_T4-hdp/traf_run
>                 2017-03-21 09:16:05 + sudo chmod -R a+rX
> /home/jenkins/workspace/phoenix_part2_T4-hdp/traf_run
>                 2017-03-21 09:16:05 + exit 0
>                 2017-03-21 09:16:05 + rc=0
>                 2017-03-21 09:16:05 + echo 'Checking shared mem'
>                 2017-03-21 09:16:05 Checking shared mem
>                 2017-03-21 09:16:05 + ls -ld /dev/shm
>                 2017-03-21 09:16:05 drwxrwxrwt 2 root root 100 Mar 21 09:15
> /dev/shm
>                 2017-03-21 09:16:05 + ls -l /dev/shm
>                 2017-03-21 09:16:05 total 12
>                 2017-03-21 09:16:05 -rw-r--r-- 1 1003 509 32 Mar 21 08:03
> sem.monitor.sem.trafodion
>                 2017-03-21 09:16:05 -rw------- 1 1003 509 32 Mar 21 08:03
> sem.rms.1003.268468606
>                 2017-03-21 09:16:05 -rw------- 1 1003 509 32 Mar 21 08:03
> sem.rms.1003.268490614
>                 2017-03-21 09:16:05 + echo ============
>                 2017-03-21 09:16:05 ============
>                 2017-03-21 09:16:05 + exit 0
>                 2017-03-21 09:16:05 + exit 0
> Regards
> Arvind
> -----Original Message-----
> From: Narendra Goyal [mailto:[email protected]]
> Sent: Friday, March 17, 2017 2:39 PM
> To: [email protected]
> Subject: RE: Trafodion Maser daily build failures
> Checked the /dev/shm directory on the build machine and that was empty. I
> was able to create a file /dev/shm/foo (as the 'trafodion' user id) - so,
> does not look like a permissions issue (on /dev/shm at least).
> I am not sure whether any build has happened on that build machine but do
> not see any orphan semaphore in /dev/shm.
> Thanks,
> -Narendra
> -----Original Message-----
> From: Selva Govindarajan [mailto:[email protected]]
> Sent: Friday, March 17, 2017 11:07 AM
> To: [email protected]
> <mailto:[email protected]>
> Subject: Trafodion Maser daily build failures
> First, I changed the subject line  so that this message doesn't get filtered
> out. Trafodion master daily build has been failing randomly with the
> following stack trace in monitor.
> (gdb) bt
> #0  0x00007feaee0eb625 in raise () from /lib64/libc.so.6
> #1  0x00007feaee0ece05 in abort () from /lib64/libc.so.6
> #2  0x000000000041f8b3 in CProcessContainer::CProcessContainer
> (this=0x270e340, nodeContainer=<value optimized out>) at process.cxx:3389
> #3  0x00000000004569cc in CNode::CNode (this=0x270e340, name=0x26e9548
> "slave-ahw23", pnid=0, rank=0) at pnode.cxx:152
> #4  0x0000000000458050 in CNodeContainer::AddNodes (this=<value optimized
> out>) at pnode.cxx:1572
> #5  0x0000000000419185 in CCluster::InitializeConfigCluster (this=0x2712270)
> at cluster.cxx:2818
> #6  0x0000000000419e25 in CCluster::CCluster (this=0x2712270) at
> cluster.cxx:597
> #7  0x000000000043473a in CTmSync_Container::CTmSync_Container
> (this=0x2712270) at tmsync.cxx:137
> #8  0x0000000000408f36 in CMonitor::CMonitor (this=0x2712270, procTermSig=9)
> at monitor.cxx:329
> #9  0x000000000040a5ab in main (argc=2, argv=0x7ffd157c0b48) at
> monitor.cxx:1308
> (gdb) f 2
> The monitor log shows
> 2017-03-16 09:21:48,327, INFO, MON, Node Number: 0,, PIN: 17918 , Process
> Name: $MONITOR,,, TID: 17918, Message ID: 101020103, [CMonitor::main],
> monitor Version 1.0.1 prodver Release 2.2.0 (Build release
> [2.0.1rc3-1425-g6155ff1_Bld883], branch 6155ff1_no_branch, date
> 20170316_0832), Started! CommType: Sockets
> 2017-03-16 09:21:48,327, INFO, MON, Node Number: 0,, PIN: 17918 , Process
> Name: $MONITOR,,, TID: 17918, Message ID: 101010401, [CCluster::CCluster]
> Validation of node down is disabled
> 2017-03-16 09:21:48,328, ERROR, MON, Node Number: 0,, PIN: 17918 , Process
> Name: $MONITOR,,, TID: 17918, Message ID: 101030703,
> [CProcessContainer::CProcessContainer], Can't create semaphore
> /monitor.sem.trafodion! (Permission denied)
> 2017-03-16 09:21:48,328, ERROR, MON, Node Number: 0,, PIN: 17918 , Process
> Name: $MONITOR,,, TID: 17918, Message ID: 101030704,
> [CProcessContainer::CProcessContainer], Can't unlink semaphore
> /monitor.sem.trafodion! (Permission denied)
> I came up with the following theory
> When a semaphore is created, a device file with the given semaphore name is
> created at /dev/shm by the process. The process owner needs to have write
> permission to create this file.  Initially I suspected it is permission
> issue of /dev/shm directory.
> I just looked at /dev/shm in the Jenkins VM. It did have the write
> permission.
>  If that's the case, it is possible the previous semaphore was not cleaned
> up correctly.  The monitor seems to create the semaphore with
> /dev/shm/sem.monitor.<user_name>. If trafodion gets the different uid
> between two different runs, it is possible that it is unable to clean it up.
> In case of RMS, we attach the port number to the semaphore name so that
> every run from the same user name will get a different semaphore name.
> ---------------------
> sem_open document shows
> EACCES The semaphore exists, but the caller does not have permission
>               to open it
> EACCES is 13 the errno returned in the gdb.
> Please offer your help to resolve this issue if you have any other idea.
> Selva



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to