On 2010-05-12 22:56, Brian Cameron wrote:
Robin:
Just to report what I did last night. I did a relogin with truss on
gnome-session-real (since the /usr/bin/gnome-session was renamed to
gnome-session-real). The system froze and was stuck as I described in my
previous post. After waiting for about 2 hours I interrupted my attempt.
The truss file reached 20MB when I interrupted the operation. I will
retry this once I get confirmation that the change I did in
"/usr/bin/gnome-session" is correct and how long it should take.
Yes the truss command should have been:
#/bin/bash
truss -faldo /tmp/truss.out.$$ gnome-session-real --debug
Sorry for providing the incorrect command in my last email.
Using truss on gnome-session like this should slow things down a bit,
but shouldn't hang the system for hours. It should make login take
a few minutes longer at most. However, if the bug you are experiencing
is caused by a process spinning and making a lot of system calls, truss
could get bogged down and hang.
The truss command is just a tool to see what system calls a process
makes, and is one way to analyze what a program is doing. If using
truss on gnome-session isn't working well, then we may need to diagnose
the problem using other methods.
Really what we are trying to pin down is specifically what process is
causing the problem. This problem would be easier to debug if we can
diagnose which process started by gnome-session is causing the hang.
If you can telnet into the machine from a second machine and run the
"top" command, and then log in and watch the top output when the hang
happens. I would keep an eye out for which processes are consuming the
most CPU and for programs whose memory size is growing at a fast rate.
If we could track down the problem to a specific program, then it would
likely be easier to diagnose what is going wrong.
Here's the output of the "top" command while it's frozen at the login
splash:
load averages: 0.15, 0.07, 0.03; up
3+17:33:26 12:30:45
79 processes: 78 sleeping, 1 on cpu
CPU states: 99.6% idle, 0.0% user, 0.4% kernel, 0.0% iowait, 0.0% swap
Kernel: 351 ctxsw, 675 intr, 97 syscall
Memory: 4094M phys mem, 314M free mem, 2047M total swap, 2047M free swap
PID USERNAME NLWP PRI NICE SIZE RES STATE TIME CPU COMMAND
1540 Robin 1 59 0 4028K 2248K cpu/2 0:00 0.02% top
1605 gdm 2 59 0 157M 41M sleep 0:00 0.01%
gnome-settings-
683 root 1 59 0 39M 28M sleep 1:03 0.01% spamd
566 root 18 59 0 16M 9420K sleep 4:13 0.00% smbd
567 root 3 59 0 7568K 6188K sleep 0:07 0.00% hald
669 root 1 59 0 3460K 2024K sleep 0:05 0.00%
hald-addon-acpi
617 root 1 59 0 11M 4020K sleep 0:10 0.00% httpd
1351 Robin 1 59 0 16M 5680K sleep 0:00 0.00% sshd
1614 gdm 1 59 0 83M 16M sleep 0:00 0.00%
gnome-power-man
69 root 1 59 0 8272K 1776K sleep 0:05 0.00% dhcpagent
33 root 7 59 0 9352K 2544K sleep 0:03 0.00% nwamd
516 daemon 5 59 0 11M 3296K sleep 0:03 0.00% idmapd
198 root 6 59 0 16M 5700K sleep 0:05 0.00% devfsadm
590 noaccess 1 59 0 2568K 1528K sleep 0:03 0.00% mdnsd
721 root 1 59 0 12M 5860K sleep 0:37 0.00% intrd
11 root 22 59 0 14M 12M sleep 0:19 0.00% svc.configd
218 root 31 59 0 13M 4704K sleep 0:10 0.00% nscd
375 root 1 59 0 8048K 1672K sleep 0:06 0.00% in.routed
648 daemon 4 59 0 11M 2768K sleep 0:05 0.00% nfsmapid
9 root 14 59 0 18M 11M sleep 0:04 0.00% svc.startd
496 root 4 59 0 10M 3600K sleep 0:03 0.00% inetd
633 root 2 59 0 74M 68M sleep 0:02 0.00% clamd
1547 root 3 59 0 63M 45M sleep 0:02 0.00% Xorg
571 root 19 59 0 28M 13M sleep 0:02 0.00% fmd
769 root 1 59 0 8212K 3348K sleep 0:01 0.00% master
650 daemon 2 60 -20 2752K 1840K sleep 0:00 0.00% lockd
659 daemon 2 60 -20 2760K 1756K sleep 0:00 0.00% nfsd
1613 gdm 1 59 0 145M 29M sleep 0:00 0.00%
gdm-simple-gree
1612 gdm 1 59 0 77M 12M sleep 0:00 0.00% metacity
1587 gdm 2 59 0 20M 10M sleep 0:00 0.00%
gnome-session
---------------
And here's what it looks like right after it stops being frozen:
PID USERNAME NLWP PRI NICE SIZE RES STATE TIME CPU COMMAND
1613 gdm 1 59 0 145M 30M sleep 0:00 0.03%
gdm-simple-gree
1547 root 3 59 0 64M 46M sleep 0:02 0.02% Xorg
1540 Robin 1 59 0 4028K 2248K cpu/3 0:00 0.02% top
683 root 1 59 0 39M 28M sleep 1:03 0.01% spamd
1605 gdm 2 59 0 157M 41M sleep 0:00 0.01%
gnome-settings-
566 root 18 59 0 16M 9420K sleep 4:13 0.00% smbd
567 root 3 59 0 7568K 6188K sleep 0:07 0.00% hald
669 root 1 59 0 3460K 2024K sleep 0:05 0.00%
hald-addon-acpi
1351 Robin 1 59 0 16M 5680K sleep 0:00 0.00% sshd
617 root 1 59 0 11M 4020K sleep 0:10 0.00% httpd
1614 gdm 1 59 0 83M 16M sleep 0:00 0.00%
gnome-power-man
33 root 7 59 0 9352K 2544K sleep 0:03 0.00% nwamd
69 root 1 59 0 8272K 1776K sleep 0:05 0.00% dhcpagent
516 daemon 5 59 0 11M 3296K sleep 0:03 0.00% idmapd
198 root 6 59 0 16M 5700K sleep 0:05 0.00% devfsadm
590 noaccess 1 59 0 2568K 1528K sleep 0:03 0.00% mdnsd
721 root 1 59 0 12M 5860K sleep 0:37 0.00% intrd
11 root 22 59 0 14M 12M sleep 0:19 0.00% svc.configd
218 root 31 59 0 13M 4704K sleep 0:10 0.00% nscd
375 root 1 59 0 8048K 1672K sleep 0:06 0.00% in.routed
648 daemon 4 59 0 11M 2768K sleep 0:05 0.00% nfsmapid
9 root 14 59 0 18M 11M sleep 0:04 0.00% svc.startd
496 root 4 59 0 10M 3600K sleep 0:03 0.00% inetd
633 root 2 59 0 74M 68M sleep 0:02 0.00% clamd
571 root 19 59 0 28M 13M sleep 0:02 0.00% fmd
769 root 1 59 0 8212K 3348K sleep 0:01 0.00% master
650 daemon 2 60 -20 2752K 1840K sleep 0:00 0.00% lockd
659 daemon 2 60 -20 2760K 1756K sleep 0:00 0.00% nfsd
1612 gdm 1 59 0 77M 12M sleep 0:00 0.00% metacity
1587 gdm 2 59 0 20M 10M sleep 0:00 0.00%
gnome-session
For example, in bug #14857 it seems that GConf may be responsible for
a similar problem. However, this is just speculation at this point.
But if the problem is with /usr/lib/gconfd-2, then perhaps running
truss on just the gconfd-2 process might give better output and work
better than trying to run truss on gnome-session.
I also tried without my correction. I rebooted the system and all I saw
was a black screen with a busy pointer. When logging in I saw that the
truss files in the /tmp/ dir were literaly exploding in size. Initially
they were 600MB in size and reached over a gig within a few minutes so I
stopped this before things would go haywire.
Yes, telling the gnome-session script to call itself would result in
the system just hanging. Again, apologies for being confusing with
my previous email.
I would also like to know how long time this trussing is supposed to
take. I understand that patience is a virtue but it's too much for me to
babysit the computer for 12 hours waiting for the login screen to
appear.
It shouldn't take more than a few minutes, really. If it is taking more
than 10 minutes, then a different approach to diagnosing the problem is
necessary.
The dtrace program is another program that might be useful in debugging
this sort of issue. I'm not much of a dtrace expert, but perhaps
someone else on this list could recommend ways to debug this sort of
issue with dtrace.
Brian
On 2010-05-11 05:11, Brian Cameron wrote:
Robin:
On 05/ 8/10 02:14 PM, Robin Axelsson wrote:
The delay is 3 minutes long (~180 seconds) and not 2 or 30 seconds.
This
is delay easy to measure since there's a clock in the bottom right
corner of the login splash that is frozen which "jumps" 3 minutes
forward when the input text field for the user login pops up. I
supply a
picture of what the login soplash screen looks like when it's
frozen and
another /var/adm/messages logfile that is linked to this image and
contains more "before and after" information.
Looking at the messages20100508-1959.log.txt file that you provided, I
notice the following:
- The GDM service started at 19:59:27
- At 19:59:33 it looks like the slave daemon informs the greeter to
prompt for the username
- At 20:06:14 it looks like the prompt was answered with a username.
This looks like a 6 minute delay, much longer than 3 minutes, but
perhaps it took you some additional time to actually enter a username?
Unfortunately, the syslog doesn't seem to contain any information
between 19:59:33 and 20:06:14 to indicate what might be causing this
hang.
Later in the log, I also notice that GDM service started at 20:08:21
and that the greeter was informed to prompt for a username at 20:08:31,
but that is where the log ends, so I don't see any delay here.
Also the 0-greeter*log files (which shows stdout/stderr when the
greeter is running) shows the same warning/error messages that I see on
my machine but I do not have a slowdown. So this log does not seem to
highlight anything unusual that is causing the hang for you.
Since this problem seems to only be affecting a few people, I wonder if
you might be able to identify what about your setup might be different
that could be causing the slowdown. Are you using a particular locale,
IM (Input Method) setup, or somesuch that might be causing GDM to
behave
differently for you than for others? If you are using a locale, then
does GDM not hang if you switch to using the default C locale?
I notice that bug #14857 in defect.opensolaris.org seems like it might
be related. I notice the same unusual GConf error in the "out" file
that you provided. Your problem sounds different than bug #14857 since
it seems to be affecting you on the login screen, while the slowdown in
bug #14857 seems to affect the user session starting. However, if this
is a locale issue, then this might be explained if you are setting your
default locale systemwide (so GDM picks it up) while the person who
reported bug #14857 might be setting their locale via the GDM login
screen.
http://defect.opensolaris.org/bz/show_bug.cgi?id=14857
It might be helpful to get some truss output to debug this problem.
If you rename /usr/bin/gnome-session to /usr/bin/gnome-session-real
and then create a script (with execute permissions) named
/usr/bin/gnome-session which contains these two lines
#/bin/bash
truss -faldo /tmp/truss.out.$$ gnome-session --debug
Then restart the "gdm" service. This will cause GDM to run the
script to start gnome-session, which will launch the gnome-session-real
with truss turned on.
This will do two things. It will make things even slower, and it will
create a /tmp/truss.out.(pid) file which will show what gnome-session
is doing, and might highlight the problem. The truss output timestamps
each line so it might better highlight where the problem is happening.
Truss output is large, so it might make sense to attach the output to
doo bug #14857 rather than as an email attachment.
> I forgot to say: I don't know what "face browser" means. If it is the
> feature that enables the possibility of "choosing" users at the login
> and/or poweroff/restart the the answer to this question is no.
Yes, that is what I meant. If you don't have the Face Browser enabled,
then there is no need to load the users. However, according to the
syslog, this slowdown is only a couple of seconds, and probably not the
issue which is causing the long slowdown that you are seeing.
---
Brian
.
.
_______________________________________________
opensolaris-discuss mailing list
[email protected]