Hi,

Here are the problem in hugeshmctl01 and a patch to fix it. Please review. 
Thanks.


Problem description(Shen, Lin Feng: [EMAIL PROTECTED]):

I am testing hugetlb with ltp-full-20080430. Those cases under
${LTPROOT}/testcases/kernel/mem/hugetlb/ are executed one by one again and
again. The test runs fine in the first a few hundreds of loops, but after
hugeshmctl01 fails for the first time, some other cases fails a lot too.

---------------- Here is the staf status -----------------
clashlp1:/proc/sys/kernel # gss
Hostname          : clashlp1
Kernel            : 2.6.16.60-0.17-ppc64
Kernel Build Date : Tue Apr 22 07:28:35 UTC 2008
Distribution      : SUSE
     --------
     Job ID        : 1
     Focus Group   : BASE
     XML File Name : /usr/local/staf/xml/clashlp1.base.xml
     Function      : Test
     Arguments     : null
     Start Date    : 20080502
     Start Time    : 14:32:06
     Clear Logs         : Disabled
     Log TC Elapsed Time: Disabled
     Log TC Num Starts  : Disabled
     Log TC Start/Stop  : Disabled

          BASE Start Time:  Fri May 2 14:32:06 CDT 2008
          Snapshot Time: Sun May  4 03:48:38 CDT 2008
          --------
          hugemmap01 (0)-local;944;7858;8802
          hugemmap02 (0)-local;8802;0;8802
          hugemmap03 (0)-local;8801;0;8802
          hugemmap04 (0)-local;908;7893;8801
          hugeshmat01 (0)-local;945;7857;8802
          hugeshmat02 (0)-local;909;7893;8802
          hugeshmat03 (0)-local;945;7857;8802
          hugeshmctl01 (0)-local;943;7859;8802
          hugeshmctl02 (0)-local;908;7894;8802
          hugeshmctl03 (0)-local;944;7858;8802
          hugeshmdt01 (0)-local;944;7858;8802
          hugeshmget01 (0)-local;945;7857;8802
          hugeshmget02 (0)-local;8802;0;8802
          hugeshmget03 (0)-local;8802;0;8802
          hugeshmget05 (0)-local;945;7857;8802
                               --pass--fail--unused

---------------- Here is the ltp log ----------------
The first failure is hugeshmctl01.

hugeshmctl01    1  FAIL  :  # of attaches is incorrect - 3
hugeshmctl01    2  PASS  :  pid, size, # of attaches and mode are correct 
- pass #2
hugeshmctl01    3  PASS  :  new mode and change time are correct
hugeshmctl01    4  PASS  :  shared memory appears to be removed

------- Here is the meminfo -------
before hugeshmctl01 fails:

clashlp1:~ # cat /proc/meminfo | tail -4
HugePages_Total:    32
HugePages_Free:     32
HugePages_Rsvd:      0
Hugepagesize:    16384 kB
clashlp1:~ #

after hugeshmctl01 fails:

clashlp1:~ # cat /proc/meminfo | tail -4
HugePages_Total:    32
HugePages_Free:     30
HugePages_Rsvd:     30
Hugepagesize:    16384 kB
clashlp1:~ #
-------------------------------------

It seems that hugeshmctl01 doesn't free some hugetlb pages when it fails. 
ps
shows that there is still an instance of hugeshmctl01 left even if 
hugeshmctl01
is not running which may attach some hugetlb pages.
-------------------------------------
clashlp1:~ # ps ax  | grep huge
14166 pts/23   S+     0:00 grep huge
29360 ?        S      0:00 hugeshmctl01
clashlp1:~ #
-------------------------------------

The problem is due to the arbitrary usleep time in hugeshmctl01 which 
results in
incorrect execution order. The intention of the sleep time is to ensure 
the
children call shmat() and pause() before the parent checks shm status and 
calls
stat_cleanup(). But there is no absolute assurance that this sleep always 
works.
------------
    281         /* sleep briefly to ensure correct execution order */
    282         usleep(250000);
------------

In the failure above, the last child process forked by the parent may not 
run
and call shmat() immediately after it's created. When the parent checks 
shm
status, it finds only 3 child attaching the shm instead of 4, so it 
reports the
failure. And then it calls stat_cleanup() to send SIGUSR1 to all children, 
but
since the last child hasn't called pause() yet, SIGUSR1 is handled before
pause(). When the last child calls pause(), since there is no further 
signal to
wake it up, it sleeps forever.



Patch:

The patch is not to change the arbitrary usleep time since any time is
arbitrary though a large time is more acceptable. The patch is to use
sigprocmask() to block SIGUSR1 before children sleep for SIGUSR1 from 
parent,
and then call sigsuspend() to unblock SIGUSR1 and sleep for SIGUSR1. By 
doing
so, we may avoid the infinite sleep and keeping attached shm forever so 
that
affect other hugetlb test.

In parent process, aonther sigprocmask() is called before usleep(). This 
has
the same effect of sleep more time.

Attachment: fix_hugeshmctl01_children_pause_forever.patch
Description: Binary data

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Ltp-list mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ltp-list

Reply via email to