"This is a story about a man named Jeb, a poor mountaineer barely kept
his family fed, and then one day when he was shootin' at some food, when
up from the ground came a'bublin crude... Black Gold. Texas Tea."

Actually, this is a story about what *should* happen when a developer
asks you to redo something you've already done. 

Of course, the above quote is from the Beverly Hillbillies since making
fun of Texans is one of the favorite pastimes of Californians. ;)

I'm certainly not an expert when it comes to testing or debugging in a
UNIX environment, but you don't have to be an expert to help. With all
the recent posts about users looking for a place to start helping and
learning, testing is a great place to get rolling. The following is a
long read with complete (overly verbose) details, so fetch a fresh cup
of coffee and get comfortable. It's may not be the "right" or "best"
way to do things, but it's what I did.

Though the snapshot info and steps used to set up the system were
posted to the intel testing thread on tech@ or set to oga@ directly, 
marco@ asked me to make sure I got it right. Here's an excerpt of the
of the exchange:

marco> jcr are you dead sure you got all the bits and pieces 
       for that intel driver thing?
jcr> marco: After cvs update, I built the kernel, then built 
     xenocara, and finally built the new driver.
jcr> If there were any missing bits after that, then I'm not 
     even aware of them.
marco> well you kind of forgot to make build
marco> and more importantly make includes
marco> would you mind retrying?
marco> i'll give you the exact commands
jcr> sure
marco> first you go to /usr/obj
marco> rm -rf *
marco> cd ../xobj
marco> rm -rf *
marco> that gives you a clean slate
marco> update both /usr/src and /usr/xenocara to -current
marco> then cd /usr/src
marco> make -j4 obj && make -j4 depend && make -j4 includes && 
       make -j4 tags && make -j4 build
marco> btw all this as root
marco> once that completes cd ../xenocara
marco> make bootstrap && make -j8 obj && make -j 4 build
marco> once that completes build a kernel with the GEM_INTELDRM 
       thing enabled 
marco> and make install that
marco> reboot and test
marco> this is more than one hour on my laptop that is fast
marco> easily 4 hours on something slow
jcr> will do. I'll start on it now.

Though I had probably done things right the first time, eliminating the
possibility that one unknowing got it wrong is sometimes required.

I had installed the then recent April 15 snapshot, then followed oga@'s
instructions, updated src and xenocara, built the kernel with GEM
support, built xenocara, and then finally built the new intel driver. 

Of course, the changes on current.html had been followed to date.
http://www.openbsd.org/faq/current.html

As far as *I* knew, everything was perfect.

Of course, what I supposedly know could always be wrong. It isn't that
I lack the skill to do things correctly and thoroughly, instead it's
just that mistakes happen to everyone. It's far better to spend the time
to validate a bug by rebuilding the test setup than it is to have one of
more developers wasting their time chasing shadows.

I usually build without X running (less resources in use and less task
switching).  Since I've seen two unprovoked crashes with the new intel
driver building from a normal terminal (without X) is how I'm doing all
of the following. Ahhh the joys of a dedicated test/build box.

Before starting on rebuilding everything to make sure it was done right,
backup the existing files so I can recreate the error as it exists now.
Though it was only the 24th when I started this redo, there have been
plenty of commits since the April 15 snapshot and April 17th xenocra cvs
update. If one of the changes fixed the issue, being able to recreate
the issue might be the only way to figure out what change made the
difference.  The April 15th snap and GEN enabled kernel used are already
saved, so I just need to keep a copy of the current /usr/X11R6 directory
which includes the new intel driver I built.

    # cd /usr
    # mkdir X11R6-old
    # cp -R X11R6/* X11R5-old/.

Show the relevant configuration:
    # cat /etc/mk.conf
    XENOCARA_RERUN_AUTOCONF=Yes
    SUDO=/usr/bin/sudo
    ACCEPT_JRL_LICENSE=Yes
    CHECK_LIB_DEPENDS=Yes
    # echo MALLOC_OPTIONS

    # ls /etc/malloc.conf
    ls: /etc/malloc.conf: No such file or directory
    # grep nosuidcoredump /etc/sysctl.conf
    kern.nosuidcoredump=2             # 2=Put suid coredumps in /var/crash
    # grep allowaperture /etc/sysctl.conf
    machdep.allowaperture=2           # see xf86(4)
    # alias mean
    alias mean='sudo nice -n -16'
    #

Clean out object cruft:
    # rm -fr /usr/obj/*
    # rm -fr /usr/xobj/*

Deleting the xenocara tree and restoring from an archive of a fresh
update is the easiest way to avoid the dumbfuckery of gnu autotools.
This is particularly true if you have XENOCARA_RERUN_AUTOCONF set in
your /etc/mk.conf since it results in tons and tons of files being
"modified" which results in cvs taking forever to update the damn mess.

If you're in the mood for pain, do a checkout of xenocara, set
XENOCARA_RERUN_AUTOCONF in your /etc/mk.conf, do a complete build of
xenocara (takes a while), then do `cvs -d$CVSROOT up -ACPd` and wait
forever as countless "modified files" scroll past. If you're on a slow
internet connection like mine, deleting/restoring makes a world of
difference in cvs update times.

NOTE: The `-C` flag for `cvs update` is currently broken, so you end up
with merged files in your tree rather than fresh copies from the
repository as the cvs man page states. This is a real mess for those
testing patches since reverting to the -HEAD branch is not possible.

    # rm -fr xenocara/*
    # tar xzvf /arc/OpenBSD/xenocara-2010.04.18.tgz
    # cd xenocara
    # cvs -d$CVSROOT up -APd
    # cd /usr
    # tar czvf /arc/OpenBSD/xenocara-2010.04.24.tgz xenocara
    # cd /usr/src
    # cvs -d$CVSROOT up -APd

With the exception of /usr/src/sys/conf/GENERIC modified files were
deleted manually since `cvs up -C` is broken.

Show the two modified files:
    # grep makeoptions /usr/src/sys/conf/GENERIC
    makeoptions     DEBUG="-g"      # compile full symbol table
    #makeoptions    PROF="-pg"      # build profiled kernel

    # cat /usr/src/sys/arch/i386/conf/GENERIC_GEM
    # GENERIC with INTELDRM_GEM
    include         "arch/i386/conf/GENERIC"
    option          INDELDRM_GEM
    option          DRMDEBUG

Do the build as per Marco's instructions. Typically I don't use `-j`
since it's a SP system running a SP kernel where setting priority is (in
theory) a more effective and less error prone choice (less task
switching should supposedly results in fewer pipeline stalls and cache
misses). Of course, you can use both and most people think it runs
faster. I haven't done timed builds with `make -j` and/or nice(1) to
prove it either way, but I just rolled with it.

    # cd /usr/src
    # make -j4 obj && make -j4 depend && make -j4 includes && make -j4 tags
    # nice -n -16 make -j4 build
    ...
    building shared c library (version 53.1)
    cc -shared -fpic -o libc.so.53.1 'lorder <...snip...>
    building shared object c library
    building profiled c library
    building standard c library
    sed: can't load library 'libc.so.53.1'
    nm: can't load library 'libc.so.53.1'
    sort: can't load library 'libc.so.53.1'
    ...
    *** Error code 4

So something *IS* hosed!

In short, after libc.so.53.1 is built, it can't be used by nm(1),
sed(1), sort(1) and such, so `make build` dies. This is a bad sign, and
there are a whole lot of possible causes.

It could be caused by a hosed local source tree. It could be a problem
with the cvs mirror I use (not entirely sync'd at update time, or other
issues). It could be a dependency line contention caused by using `-j`
for more jobs (e.g. the `-B` flag is disabled when `make -j` is used).
It could be kernel and userland being out of sync.  It could be I missed
something from the list of changes for current which are kept in the
usual place (). It could be Marco didn't get his instructions quite
perfect. Or any of a number of possible causes.

Now it's time to start eliminating possibilities...

Since Marco's instructions vary from the FAQ, fall back to the known
good method provided by the FAQ, namely the typical build and install
the kernel first, make sure distrib-dirs exist ... and so on. More
importantly, it would be good to figure out if the kernel itself builds,
mainly because I was already running and testing with a kernel I had
built myself with GEM support.

    # rm -fr /usr/obj/*
    # cd /usr/src/sys/arch/i386/conf
    # config GENERIC_GEM
    # cd ../compile/GENERIC_GEM
    # make clean && make depend 
    # nice -n -16 make
    # make install
    # reboot

Interesting and good. At least the kernel build worked.

    # cd /usr/src
    # nice -n -16 make -j4 obj
    # cd /usr/src/etc && env DESTDIR=/ make distrib-dirs 
    # cd /usr/src
    # nice -n -16 make -j4 depend
    # nice -n -16 make -j4 includes 
    # nice -n -16 make -j4 tags
    # nice -n -16 make -j4 build
    ...
    building shared c library (version 53.1)
    cc -shared -fpic -o libc.so.53.1 'lorder <...snip...>
    building shared object c library
    building profiled c library
    building standard c library
    sed: can't load library 'libc.so.53.1'
    nm: can't load library 'libc.so.53.1'
    sort: can't load library 'libc.so.53.1'
    ...
    *** Error code 4

Yep. Same problem. Since my tree could be hosed, or the mirror I used could
be hosed (anoncvs3.usa.openbsd.org run by todd@), the best idea is to
eliminate these two possibilities.

Considering there have been some libc related problems with one of the
us mirrors (anoncvs1.usa.openbsd.org also run by todd@) reported on
misc@ around April 19 [1] and todd@ just fixed a quietly reported bug
with `cvs -d$CVSROOT status` on April 18, it really could be the cvs
server.

REF: http://marc.info/?t=127169228500007&r=1&w=2&n=15

Switch over to using brad's server:
    # export [email protected]:/cvs

Since if there's no telling if my backup copy of the tree is effected,
delete my local tree, start with 4.6 source from CD, and update.

    # rm -fr /usr/src/*
    # cd /usr/src
    # tar zxf /arc/OpenBSD/src-4.6.tar.gz
    # cvs -d$CVSROOT up -Pd
    # tar czf /arc/OpenBSD/src-2010.04.25.tgz

Uncomment ``makeoptions DEBUG="-g"'' from /usr/src/sys/conf/GENERIC
Recreate the /usr/src/sys/arch/i386/conf/GENERIC_GEM config file.

Similar to the possibility of a hosed tree, avoid any potential (or
imaginary?) dependency line issue caused by using `make -j` by simply
not using `-j` at all. 

    # cd /usr/src/sys/arch/i386/conf
    # config GENERIC_GEM
    # cd ../compile/GENERIC_GEM
    # mean make clean
    # mean make depend
    # mean make
    # make install
    # reboot

    # rm -rf /usr/obj/*
    # cd /usr/src
    # mean make obj
    # cd /usr/src/etc && env DESTDIR=/ make distrib-dirs
    # cd /usr/src

Marco specifically asked for `make depend` `make includes` and `make
tags` to be run before `make build` so deviate from the faq a little.

    # mean make depend && mean make includes && mean make tags
    # mean make build
    # reboot

Excellent! The userland build went perfectly. If this had failed, then I
probably would have gone running off into the weeds, screaming. 

After regaining my senses, I would have installed the most recent snap
(Install, not Update), and start over with new src checkout. And if this
also failed, I'd start testing for ever present possibility of having
hardware failures.

At present, it seems we've isolated the earlier build problem to be
either: 
  1.) caused by a hosed tree
  2.) caused by a hosed cvs server
  3.) caused by a hosed `make -j`

After we're done testing what we are actually trying to test, namely if
I had some bits wrong affecting the intel driver, we'll come back later
to further isolate the cause of the previous userland build failure. If
there's a problem either with `make -j` or with the cvs mirror, finding
it is important.

Since the problem *might* be the cvs mirror I was using and my archive
*might* contain a hosed checkout of the xenocara tree, once again the
right answer is revert to known good source from release CD, and update
it from a different mirror (previously set).

    # rm -fr /usr/xobj/*
    # rm -fr /usr/xenocara/*
    # cd /usr
    # tar xzf /arc/OpenBSD/xenocara-4.6.tar.gz
    # cd /usr/xenocara
    # cvs -d$CVSROOT up -APd

At this late stage in the development cycle (post 4.7 release), it
probably would have been faster to just do a checkout rather than
updating from 4.6 release sources.

    # cd /usr
    # tar czf /arc/OpenBSD/xenocara-2010.04.25.tgz xenocara

    # cd /usr/xenocara
    # mean make bootstrap
    # mean make obj
    # mean make build

Build the new intel driver:
    # cd /usr/xenocara/driver
    # mv xf86-video-intel xf86-video-intel.cvs
    # tar xzf /arc/OpenBSD/intel-current.tgz
    # cd xf86-video-intel
    # rm -fr /usr/xobj/driver/xf86-video-intel/*
    # make -f Makefile.bsd-wrapper obj
    # make -f Makefile.bsd-wrapper build
    # reboot

Rebooting was not strictly necessary, but gave me a warm fuzzy feeling.
The system is all set up to save any hard crashes, so I'm following
xenocra/README to get the core (running as root).

    # cat /etc/X11/xorg.conf
    Section "ServerFlags"
      Option  "NoTrapSignals" "true"
    EndSection
    # startx -- /usr/X11R6/bin/X -keepPriv

Below are is the VT switching which previously gave a repeatable crash
of the X server. The crash previously occurred in two stages, first the
display would go wonky when switching back to graphics mode, and then
on the following switch to text mode, the X server would crash.

    CTL-ALT-F1
    CTL-ALT-F2
    CTL-ALT-F3
    CTL-ALT-F4
    CTL-ALT-F5        (back to X -- display goes wonky, but X running)
    CTL-ALT-F1        (X dies)

With the freshly rebuilt system, the crash was not as consistent as
before. I had to poke at it a bit, flip the resolution, try playing
video files with various video output drivers (`mplayer -vo ?`), and
repeating the above VT switching a few times.

    # xrandr
    Screen 0: minimum 320 x 200, current 1600 x 1200, maximum 2048 x 2048
    VGA connected 1600x1200+0+0 (normal left inverted right x axis y axis)
    388mm x 291mm
       1600x1200      75.0*    75.0  
       1280x1024      76.1     75.0  
       1152x900       76.1  
       1024x768       75.0     70.1     60.0     43.5  
       832x624        74.6  
       800x600        72.2     75.0     60.3     56.2  
       640x480        75.0     72.8     66.7     59.9  
       720x400        87.8     70.1  

switch graphics mode
    # xrandr -s 1280x1024 -r 76.1
    # xrandr -s 1152x900 -r 76.1
    # xrandr -s 1024x768 -r 75

There's a bit more bad news. Previously Xv (the "X Video" extension)
worked with resolutions at or below 1280x1024, but now it doesn't work
at all. The `-vo xv` output driver of mplayer doesn't work and only
results in a nice blue filled window where the video should be playing.

With enough poking, it finally crashed in the same manner as before, but
the output to VT1 was a bit different.

    (==) Log file: "/var/log/Xorg.0.log", Time: Sun Apr 25 11:56:38 2010
    (==) Using config file: "/etc/X11/xorg.conf"
    inteldrm0: gpu hung!
    no reset function for chipset.
    /usr/xenocra/lib/libdrm/intel/intel_bugmgr_gem.c:953: Error setting CPU
    domain in 699: Input/output error
    (EE) intel(0): Failed to submit batch buffer, expect rendering
    corruption or even a frozen display: Input/output error.

    Fatal server error:
    DRM_I915_LEAVEVT failed: Unkown error: -5

    Please consult The X.Org Foundation support
      at http:/wiki.x.org

    FatalError re-entered, aborting
    DRM_I915_LEAVEVT failed: Unknown error: -5

    error: [drm:pid22269:inteldrm_lastclose] *ERROR* failed to idle
    hardware: 5
    XIO: fatial IO erro 35 (Resource temporarily unavailable) on X server
        ":0.0" after 4210 requests (4207 known processed) with 0 events
        remaining.
    XIO: fatial IO erro 35 (Resource temporarily unavailable) on X server
        ":0.0" after 7806 requests (7805 known processed) with 0 events
        remaining.
    gvim: Fatal IO error 35 (Resource temporarily unavailable) on X server
    :0.0.
    xinit: connection to X server lost.
    xterm: fatal IO error 35 (Resource temporarily unavailable) or
    KillClient on X server ":0.0"
    xterm: fatal IO error 35 (Resource temporarily unavailable) or
    KillClient on X server ":0.0"
    xterm: fatal IO error 35 (Resource temporarily unavailable) or
    KillClient on X server ":0.0"

Since not all of the console output makes it into /var/log/Xorg.0.log
it's best to manually write it out to a text file. Now it's time to
collect all the relevant files, tar them up, put them on a server so the
developers can access them, and email out the link to them.

    # cp /var/log/Xorg.0.log /home/jcr/bugs/intel-03/Xorg.0.log-rebuild
    # cp /var/crash/Xorg.core /home/jcr/bugs/intel-03/Xorg.core-rebuild
    # cp /bsd /home/bugs/intel-03/bsd-rebuild
    # cp /usr/src/sys/arch/i386/compile/bsd.gdb \
      /home/jcr/intel-03/bsd.gdb-rebuild
    # cd /home/jcr/intel-03
    # tar czvf rebuild.tgz Xorg.0.log-rebuild Xorg.core-rebuild \
      bsd-rebuild bsd.gdb-rebuild
    # scp rebuild.tgz m...@mywebserver

Though I did need to deviate from marco@'s exact commands, and improvise
to get around issues, this is to be expected. As long as you're paying
attention, being careful, and keeping good notes on what you do, it's
not an issue.

Congratulations! According to some people, I just managed to "waste" a
day and a half of my life proving there was nothing wrong with my
original tests, but to me (and hopefully the developers), it was time
very well spent. We now have confirmed the previously reported problems
are repeatable and not self-inflicted, knowingly or otherwise.

In the process of redoing the tests I did find something wrong (either a
hosed `make -j`, a hosed local source tree, or a hosed cvs mirror) and
now I need to go back, figure it out, and report it if it's not a local
problem.

Oddly enough, during one of my reboots above, fortune(6) gave me the
following relevant gem:

    "But the reward of a successful collaboration is a thing that cannot
    be produced by either of the parties working alone. It is akin to
    the benefits of sex with a partner as opposed to masturbation. The
    latter is fun, but you show me anyone who has gotten a baby from
    playing with him or herself, and I'll show you an ugly baby with a
    whole bunch of knuckles."  -- Harlan Ellison

How fitting.

  jcr



-- 
The OpenBSD Journal - http://www.undeadly.org

Reply via email to