On 09/01/2010 12:41 AM, Jeff Johnson wrote:
Yesterday we could reproduce the rpm locking problem.
The openpkg rc facility is started by the cron.
The rc facility does a "rpm -q openpkg" query.
This query is locked.
We have seen 100 of them in the process table (limit of the cron).
In this situation all rpm queries hang as well as a db_verify.
At the end I have attached the output of db_stat -CA
root 12777 12589 0 Sep 02 ? 1:03 /usr/sbin/cron
root 4501 12777 0 17:10:00 ? 0:00 sh -c
/opt/local/opkg/bin/openpkg rpm -q openpkg > /dev/null 2>&1 && touch /v
root 4509 4501 0 17:10:00 ? 0:02
/opt/local/opkg/lib/openpkg/rpm -q openpkg
In the process table we had 100 processes (cron limit) of
/opt/local/opkg/lib/openpkg/rpm -q openpkg
after the termination of all rpm queries we saw the rpm-db in the state
"run database recovery"
xxxxx:/opt/local/opkg/RPM/DB# openpkg rpm -qavv > /var/tmp/rpm-qavv.out
D: pool fd: created size 464 limit -1 flags 0
D: pool iob: created size 20 limit -1 flags 0
D: pool mire: created size 84 limit -1 flags 0
D: pool lua: created size 32 limit -1 flags 0
D: pool ts: created size 888 limit -1 flags 0
D: pool db: created size 208 limit -1 flags 0
rpm: openpkg: layout: [global] tmpdir = "/opt/local/opkg/RPM/TMP"
D: opening db environment /opt/local/opkg/RPM/DB/Packages auto_commit:cdb
rpmdb: Thread/process 2479/0 failed: Thread died in Berkeley DB library
error: db4 error(-30974) from dbenv->failchk: DB_RUNRECOVERY: Fatal
error, run database recovery
error: Runnning db->verify ...
D: opening db environment /opt/local/opkg/RPM/DB/Packages auto_commit:cdb
D: opening db index /opt/local/opkg/RPM/DB/Packages rdonly mode=0x0
D: locked db index /opt/local/opkg/RPM/DB/Packages
D: opening db index /opt/local/opkg/RPM/DB/Name rdonly mode=0x0
A "rpm --db-cleanup" has fixed the rpm-db.
xxxxx:/opt/local/opkg/RPM/DB# /opt/local/opkg/bin/openpkg rpm --db-cleanup
rpmdb: CLEANING UP EXISTING RPM DATABASE (/opt/local/opkg/RPM/DB)
rpmdb: cleaning up RPM database DB files
rpmdb: making sure RPM database are of the correct type
rpmdb: rebuilding RPM database (built-in RPM procedure)
rpmdb: making sure RPM database files have consistent attributes
rpmdb: performing read/write operation on RPM database
now the output of "db_stat -CA"
Default locking region information:
1503 Last allocated locker ID
0x7fffffff Current maximum unused locker ID
5 Number of lock modes
1000 Maximum number of locks possible
1000 Maximum number of lockers possible
1000 Maximum number of lock objects possible
50 Number of lock object partitions
296 Number of current locks
306 Maximum number of locks at any one time
196 Maximum number of locks in any one bucket
263 Maximum number of locks stolen by for an empty partition
176 Maximum number of locks stolen for any one partition
492 Number of current lockers
492 Maximum number of lockers at any one time
2 Number of current lock objects
5 Maximum number of lock objects at any one time
1 Maximum number of lock objects in any one bucket
0 Maximum number of objects stolen by for an empty partition
0 Maximum number of objects stolen for any one partition
4517 Total number of locks requested
3958 Total number of locks released
0 Total number of locks upgraded
827 Total number of locks downgraded
35 Lock requests not available due to conflicts, for which we waited
0 Lock requests not available due to conflicts, for which we did not wait
0 Number of deadlocks
0 Lock timeout value
0 Number of locks that have timed out
0 Transaction timeout value
0 Number of transactions that have timed out
504KB The size of the lock region
10 The number of partition locks that required waiting (0%)
4 The maximum number of times any partition lock was waited for (0%)
0 The number of object queue operations that required waiting (0%)
17 The number of locker allocations that required waiting (0%)
0 The number of region locks that required waiting (0%)
1 Maximum hash bucket length
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Lock REGINFO information:
Lock Region type
4 Region ID
__db.004 Region name
0xfed80000 Original region address
0xfed80000 Region address
0xfed800c8 Region primary address
0 Region maximum allocation
0 Region allocated
Region allocations: 3006 allocations, 0 failures, 0 frees, 1 longest
Allocations by power-of-two sizes:
1KB 3002
2KB 0
4KB 1
8KB 0
16KB 2
32KB 0
64KB 1
128KB 0
256KB 0
512KB 0
1024KB 0
REGION_JOIN_OK Region flags
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Lock region parameters:
283 Lock region region mutex [0/581 0% 19054/0]
1031 locker table size
1031 object table size
560 obj_off
64720 locker_off
1 need_dd
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Lock conflict matrix:
0 0 0 0 0
0 0 1 0 0
0 1 1 1 1
0 0 0 0 0
0 0 1 0 1
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Locks grouped by lockers:
Locker Mode Count Status ----------------- Object ---------------
407 dd= 0 locks held 0 write locks 0 pid/thread 7707/0
408 dd= 0 locks held 1 write locks 0 pid/thread 7713/0
408 READ 1 HELD (2808e 4010004 a200b853 67dd 0)
handle 0
409 dd= 0 locks held 0 write locks 0 pid/thread 7713/0
40a dd= 0 locks held 1 write locks 0 pid/thread 7713/0
40a READ 1 HELD (2808f 4010004 a14ac504 1ee7d 0)
handle 0
40b dd= 0 locks held 1 write locks 0 pid/thread 7705/0
40b READ 1 HELD (2808e 4010004 a200b853 67dd 0)
handle 0
40c dd= 0 locks held 0 write locks 0 pid/thread 7705/0
40d dd= 0 locks held 1 write locks 0 pid/thread 7705/0
40d READ 1 HELD (2808f 4010004 a14ac504 1ee7d 0)
handle 0
40e dd= 0 locks held 1 write locks 0 pid/thread 7707/0
40e READ 1 HELD (2808e 4010004 a200b853 67dd 0)
handle 0
40f dd= 0 locks held 0 write locks 0 pid/thread 7707/0
410 dd= 0 locks held 1 write locks 0 pid/thread 7707/0
410 READ 1 HELD (2808f 4010004 a14ac504 1ee7d 0)
handle 0
.....
3ff dd= 0 locks held 1 write locks 0 pid/thread 4509/0
3ff READ 1 HELD (2808e 4010004 a200b853 67dd 0)
handle 0
400 dd= 0 locks held 0 write locks 0 pid/thread 4509/0
401 dd= 0 locks held 1 write locks 0 pid/thread 4509/0
401 READ 1 HELD (2808f 4010004 a14ac504 1ee7d 0) handle
.....
Any hint out of this output??
Best regards
Roman
On Aug 31, 2010, at 4:23 PM, Roman Gaechter wrote:
The problem is intermittend and we are not able tho reproduce it as
we like.
Intermittent is what makes it tricky to identify.
At the moment the db_verify returns no error.
So technically no "corruption" if db_verify doesn't detect any.
rpm -qavv will show the data integrity checks: each installed header
has a SHA1
digest attached, and you should see confirmation that the SHA1 digest
check passes when -vv is passed.
Maybe the output of db_stat -CA will give a clue?
If a "hang" is the issue, running db_stat will show what locks are held.
The output you have attached is from a quiescent system afaict.
Try running (as root so that locks can be acquired) rpm -qavv as before.
In the middle, hit ^Z.
then run db_stat -Cl.
You should see a shared/read lock from the running rpm -qa in the
db_stat output.
The locks usually have a state attached like READ or WRITE or HELD.
And one cause of "hangs" is a deadlock or a "stale lock" (though
rpm should detect and clean up "stale locks" automatically these days).
One way to see "stale lock" cleanup is to start an rpm install
operation, and then send kill -9
to rpm to cause immediate program termination.
The next rpm invocation (as root) should have msgs displaying the
"stale lock"
cleanup (from the previously terminated by kill -9 rpm install).
That's basically the techniques for diagnosing "hangs" with RPM and
Berkeley DB.
hth
73 de Jeff
db_tool db_stat -CA
Default locking region information:
629 Last allocated locker ID
0x7fffffff Current maximum unused locker ID
5 Number of lock modes
1000 Maximum number of locks possible
1000 Maximum number of lockers possible
1000 Maximum number of lock objects possible
160 Number of lock object partitions
0 Number of current locks
48 Maximum number of locks at any one time
4 Maximum number of locks in any one bucket
0 Maximum number of locks stolen by for an empty partition
0 Maximum number of locks stolen for any one partition
0 Number of current lockers
12 Maximum number of lockers at any one time
0 Number of current lock objects
34 Maximum number of lock objects at any one time
1 Maximum number of lock objects in any one bucket
0 Maximum number of objects stolen by for an empty partition
0 Maximum number of objects stolen for any one partition
12898 Total number of locks requested
12898 Total number of locks released
90 Total number of locks upgraded
495 Total number of locks downgraded
4 Lock requests not available due to conflicts, for which we waited
0 Lock requests not available due to conflicts, for which we
did not wait
0 Number of deadlocks
0 Lock timeout value
0 Number of locks that have timed out
0 Transaction timeout value
0 Number of transactions that have timed out
512KB The size of the lock region
11 The number of partition locks that required waiting (0%)
7 The maximum number of times any partition lock was waited for
(0%)
0 The number of object queue operations that required waiting (0%)
3 The number of locker allocations that required waiting (0%)
0 The number of region locks that required waiting (0%)
1 Maximum hash bucket length
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Lock REGINFO information:
Lock Region type
4 Region ID
__db.004 Region name
0xfed00000 Original region address
0xfed00000 Region address
0xfed000c8 Region primary address
0 Region maximum allocation
0 Region allocated
Region allocations: 3006 allocations, 0 failures, 0 frees, 1 longest
Allocations by power-of-two sizes:
1KB 3002
2KB 0
4KB 0
8KB 1
16KB 2
32KB 0
64KB 1
128KB 0
256KB 0
512KB 0
1024KB 0
REGION_JOIN_OK Region flags
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Lock region parameters:
283 Lock region region mutex [0/223 0% 23806/0]
1031 locker table size
1031 object table size
560 obj_off
69560 locker_off
1 need_dd
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Lock conflict matrix:
0 0 0 0 0
0 0 1 0 0
0 1 1 1 1
0 0 0 0 0
0 0 1 0 1
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Locks grouped by lockers:
Locker Mode Count Status ----------------- Object ---------------
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Locks grouped by object:
Locker Mode Count Status ----------------- Object ---------------
Regards
Roman
On 08/31/2010 02:20 PM, Jeff Johnson wrote:
On Aug 31, 2010, at 4:23 AM, Roman Gächter wrote:
Hello
We run into serious problems with OpenPKG 4.0.2 and 4.0.7 on
solaris 10 SPARC in "solaris whole rooted" zones.
After some days of normal operation our rpm db gets corrupted.
(aside)
"Corrupted" means data loss or integrity failure technically
(although its quite common to hear "corruption" applied
to Berkeley DB with other meanings.)
Is there data loss or integrity failure? Running db_verify
cd /var/lib/rpm
db_verify Packages
against Packages or other "tables" (Berkeley DB calls
"tables" databases) will detect integrity failures.
Data loss can be detected by, say, counting the number
of packages installed like
rpm -qa | wc -l
We see hanging processes from the the openpkg rc facility.
The openpkg rc script tests the rpm db with a "openpkg rpm -q
openpkg" query.
These queries hang up.
Hang up how? Can you add -vv to a query? There is also
cd /var/lib/rpm
db_stat -CA # <-- -CA displays all info, -Cl displays lock info
Note that you MUST use db_verify/db_stat for the same version of
Berkeley DB used by RPM. The tools for the same version of Berkeley
DB if internal
to RPM are released with RPM.
When this happens the permission of the rpm lockfiles
RPM/DB/__db.001 .002 .003 changes to root.
An rpmdb is protected by permissions on /var/lib/rpm.
The associated __db* files are created as needed.
I would expect "root" as owner typically, because /var/lib/rpm is
usually writable only by "root", and rpm installs are typically run
as "root".
But the openpkg wrapper appears to have a different ownership model.
To recover from this problem we have to terminate the rpm queries,
cleanup and rebuild the rpm db and to change back
the permissions of the rpm lockfiles to the openpkg user.
The building of our openpkg4 binary package could be done without
problems with the original openpkg-4.0.7 source package.
We are using the VALUE license.
Does anyone had similar problems?
Any hints to resolve?
Identifying what is causing the "hang" is what is needed first.
hth
73 de Jeff
Best regards
Roman
--
--
Mit freundlichen Grüssen
==============================
Roman Gächter
Principal Consultant
roman.gaech...@trivadis.com
Trivadis AG
Infrastructure Managed Services
Europastrasse 5
CH-8152 Glattbrugg
Tel.: +41 44 808 70 20
Fax : +41 44 808 70 21
Mobile: +41 79 412 97 11
http://www.trivadis.com
==============================