Hi Loris,
to cleanup the database, you should
- stop Robinhood
- run '/usr/sbin/rbh-config empty_db [<db_name> [<db_admin_passwd>]]'
to clean the database
- maybe clean your Lustre changelogs with '/usr/bin/lfs changelog_clear
<mdtname> <reader_id> 0' (if you use changelogs in RBH, but I did not
see 'log_reader' in your started modules in STATS)
- start Robinhood again (all tables will be recreated automatically)
Cheers, Carsten
--
Carsten Beyer
Abteilung Systeme
Deutsches Klimarechenzentrum GmbH (DKRZ)
Bundesstraße 45a * D-20146 Hamburg * Germany
Phone: +49 40 460094-221
Fax: +49 40 460094-270
Email: be...@dkrz.de
URL: http://www.dkrz.de
Geschäftsführer: Prof. Dr. Thomas Ludwig
Sitz der Gesellschaft: Hamburg
Amtsgericht Hamburg HRB 39784
On 31.07.2017 10:19, Loris Bennett wrote:
Hi,
Davide Tacchella <dtacchella@cr...> writes:
Hello,
On 07/28/17 09:25, Loris Bennett wrote:
Hi,
The scan on one of my file systems seems to be stuck
Last filesystem scan:
status: running
start: 2017/07/26 08:29:14 (2d 48min 33s ago)
last action: 2017/07/26 08:30:37 (2d 47min 10s ago)
The log file repeatedly shows the same file being processed throughout
this period:
2017/07/28 09:14:14 [5738/1] STATS | ======== General statistics =========
2017/07/28 09:14:14 [5738/1] STATS | Daemon start time: 2017/07/26 08:29:14
2017/07/28 09:14:14 [5738/1] STATS | Started modules:
scan,policy_run(cleanup)
2017/07/28 09:14:14 [5738/1] STATS | ======== FS scan statistics =========
2017/07/28 09:14:14 [5738/1] STATS | current scan interval = 1.0d
2017/07/28 09:14:14 [5738/1] STATS | scan is running:
2017/07/28 09:14:14 [5738/1] STATS | started at : 2017/07/26 08:29:14
(2.0d ago)
2017/07/28 09:14:14 [5738/1] STATS | last action: 2017/07/26 08:30:37
(2.0d ago)
2017/07/28 09:14:14 [5738/1] STATS | progress : 387679 entries
scanned (0 errors)
2017/07/28 09:14:14 [5738/1] STATS | inst. speed (potential):
490.02 entries/sec (12.24 ms/entry/thread)
2017/07/28 09:14:14 [5738/1] STATS | avg. speed (effective):
2.21 entries/sec (1.22 ms/entry/thread)
2017/07/28 09:14:14 [5738/1] STATS | ==== EntryProcessor Pipeline Stats ===
2017/07/28 09:14:14 [5738/1] STATS | Idle threads: 0
2017/07/28 09:14:14 [5738/1] STATS | Id constraints count: 10000 (hash
min=0/max=6/avg=0.6)
2017/07/28 09:14:14 [5738/1] STATS | Name constraints count: 10000 (hash
min=0/max=6/avg=0.6)
2017/07/28 09:14:14 [5738/1] STATS | Stage | Wait | Curr |
Done | Total | ms/op |
2017/07/28 09:14:14 [5738/1] STATS | 0: GET_FID | 0 | 0 |
0 | 0 | 0.00 |
2017/07/28 09:14:14 [5738/1] STATS | 1: GET_INFO_DB | 9438 | 9 |
3 | 396274 | 1.85 |
2017/07/28 09:14:14 [5738/1] STATS | 2: GET_INFO_FS | 0 | 0 |
0 | 396271 | 0.00 |
2017/07/28 09:14:14 [5738/1] STATS | 3: PRE_APPLY | 0 | 0 |
0 | 396271 | 0.00 |
2017/07/28 09:14:14 [5738/1] STATS | 4: DB_APPLY | 379 | 171 |
0 | 395721 | 0.21 | 100.00% batched (avg batch size: 116.2)
2017/07/28 09:14:14 [5738/1] STATS | 5: RM_OLD_ENTRIES | 0 | 0 |
0 | 0 | 0.00 |
2017/07/28 09:14:14 [5738/1] STATS | DB ops: get=293/ins=1/upd=395720/rm=0
2017/07/28 09:14:14 [5738/1] STATS | --- Pipeline stage details ---
2017/07/28 09:14:14 [5738/1] STATS | GET_INFO_DB : first:
/home/user01/A/B/run02/wrfV2756-0003-00478-000478-01.hdr, status=processing
2017/07/28 09:14:14 [5738/1] STATS | GET_INFO_DB : last:
/home/user01/A/B/run03/swrafV3158-0004-00623-000623-01.img, status=waiting
2017/07/28 09:14:14 [5738/1] STATS | DB_APPLY : first:
63BBC16B/16033637, status=processing
2017/07/28 09:14:14 [5738/1] STATS | DB_APPLY : last:
63BBC16B/397701125, status=waiting
Restarting mysql hasn't solved the problem. The scan on my other file
system runs fine.
Any ideas what the problem is and how it can be addressed?
Are you sure the problem is RH ?
As the scan on one file system works and the one on the other file
system doesn't, my assumption is that there is some problem with one or
more of the database tables related to the second file system.
Current STATS snapshot shows DB is slowing you down, 9400 of 10k actions
are queued in get_db_info stage
for better understanding of the problem would be nice if you could add:
- at least 2 subsequent STATS showing no progress, you only provided
one
Here are the last two entries from today:
2017/07/31 09:44:15 [5738/1] STATS | ==================== Dumping stats at
2017/07/31 09:44:15 =====================
2017/07/31 09:44:15 [5738/1] STATS | ======== General statistics =========
2017/07/31 09:44:15 [5738/1] STATS | Daemon start time: 2017/07/26 08:29:14
2017/07/31 09:44:15 [5738/1] STATS | Started modules:
scan,policy_run(cleanup)
2017/07/31 09:44:15 [5738/1] STATS | ======== FS scan statistics =========
2017/07/31 09:44:15 [5738/1] STATS | current scan interval = 1.0d
2017/07/31 09:44:15 [5738/1] STATS | scan is running:
2017/07/31 09:44:15 [5738/1] STATS | started at : 2017/07/26 08:29:14
(5.1d ago)
2017/07/31 09:44:15 [5738/1] STATS | last action: 2017/07/26 08:30:37
(5.1d ago)
2017/07/31 09:44:15 [5738/1] STATS | progress : 387679 entries
scanned (0 errors)
2017/07/31 09:44:15 [5738/1] STATS | inst. speed (potential): 490.02
entries/sec (12.24 ms/entry/thread)
2017/07/31 09:44:15 [5738/1] STATS | avg. speed (effective): 0.89
entries/sec (1.22 ms/entry/thread)
2017/07/31 09:44:15 [5738/1] STATS | ==== EntryProcessor Pipeline Stats ===
2017/07/31 09:44:15 [5738/1] STATS | Idle threads: 0
2017/07/31 09:44:15 [5738/1] STATS | Id constraints count: 10000 (hash
min=0/max=6/avg=0.6)
2017/07/31 09:44:15 [5738/1] STATS | Name constraints count: 10000 (hash
min=0/max=6/avg=0.6)
2017/07/31 09:44:15 [5738/1] STATS | Stage | Wait | Curr | Done
| Total | ms/op |
2017/07/31 09:44:15 [5738/1] STATS | 0: GET_FID | 0 | 0 | 0
| 0 | 0.00 |
2017/07/31 09:44:15 [5738/1] STATS | 1: GET_INFO_DB | 9438 | 9 | 3
| 396274 | 1.85 |
2017/07/31 09:44:15 [5738/1] STATS | 2: GET_INFO_FS | 0 | 0 | 0
| 396271 | 0.00 |
2017/07/31 09:44:15 [5738/1] STATS | 3: PRE_APPLY | 0 | 0 | 0
| 396271 | 0.00 |
2017/07/31 09:44:15 [5738/1] STATS | 4: DB_APPLY | 379 | 171 | 0
| 395721 | 0.21 | 100.00% batched (avg batch size: 116.2)
2017/07/31 09:44:15 [5738/1] STATS | 5: RM_OLD_ENTRIES | 0 | 0 | 0
| 0 | 0.00 |
2017/07/31 09:44:15 [5738/1] STATS | DB ops: get=293/ins=1/upd=395720/rm=0
2017/07/31 09:44:15 [5738/1] STATS | --- Pipeline stage details ---
2017/07/31 09:44:15 [5738/1] STATS | GET_INFO_DB : first:
/home/user01/A/B/run02/wrfV2756-0003-00478-000478-01.hdr, status=processing
2017/07/31 09:44:15 [5738/1] STATS | GET_INFO_DB : last:
/home/user01/A/B/run03/swrafV3158-0004-00623-000623-01.img, status=waiting
2017/07/31 09:44:15 [5738/1] STATS | DB_APPLY : first:
63BBC16B/16033637, status=processing
2017/07/31 09:44:15 [5738/1] STATS | DB_APPLY : last:
63BBC16B/397701125, status=waiting
2017/07/31 09:44:15 [5738/1] STATS | ======= cleanup policy: trigger stats
======
2017/07/31 09:44:15 [5738/1] STATS | Trigger #0 (scheduled) : running.
2017/07/31 09:44:15 [5738/1] STATS | last run: none
2017/07/31 09:44:15 [5738/1] STATS | total : none
2017/07/31 09:44:15 [5738/1] STATS | ======= cleanup policy: action stats
======
2017/07/31 09:44:15 [5738/1] STATS | idle threads = 5
2017/07/31 09:44:15 [5738/1] STATS | queued actions = 0
2017/07/31 09:44:15 [5738/1] STATS | action status:
2017/07/31 09:44:15 [5738/1] STATS | action successful = 0
2017/07/31 09:44:15 [5738/1] STATS | no matching rule =
8832
2017/07/31 09:44:15 [5738/1] STATS | 0 actions successful/8832, 0 (0 blocks)
2017/07/31 09:44:15 [5738/1] STATS | last action queued 436418 s ago
2017/07/31 09:44:15 [5738/1] STATS | last action started 436418 s ago
2017/07/31 09:44:15 [5738/1] STATS | last action completed 436418 s ago
2017/07/31 09:59:15 [5738/1] STATS | ==================== Dumping stats at
2017/07/31 09:59:15 =====================
2017/07/31 09:59:15 [5738/1] STATS | ======== General statistics =========
2017/07/31 09:59:15 [5738/1] STATS | Daemon start time: 2017/07/26 08:29:14
2017/07/31 09:59:15 [5738/1] STATS | Started modules:
scan,policy_run(cleanup)
2017/07/31 09:59:15 [5738/1] STATS | ======== FS scan statistics =========
2017/07/31 09:59:15 [5738/1] STATS | current scan interval = 1.0d
2017/07/31 09:59:15 [5738/1] STATS | scan is running:
2017/07/31 09:59:15 [5738/1] STATS | started at : 2017/07/26 08:29:14
(5.1d ago)
2017/07/31 09:59:15 [5738/1] STATS | last action: 2017/07/26 08:30:37
(5.1d ago)
2017/07/31 09:59:15 [5738/1] STATS | progress : 387679 entries
scanned (0 errors)
2017/07/31 09:59:15 [5738/1] STATS | inst. speed (potential): 490.02
entries/sec (12.24 ms/entry/thread)
2017/07/31 09:59:15 [5738/1] STATS | avg. speed (effective): 0.89
entries/sec (1.22 ms/entry/thread)
2017/07/31 09:59:15 [5738/1] STATS | ==== EntryProcessor Pipeline Stats ===
2017/07/31 09:59:15 [5738/1] STATS | Idle threads: 0
2017/07/31 09:59:15 [5738/1] STATS | Id constraints count: 10000 (hash
min=0/max=6/avg=0.6)
2017/07/31 09:59:15 [5738/1] STATS | Name constraints count: 10000 (hash
min=0/max=6/avg=0.6)
2017/07/31 09:59:15 [5738/1] STATS | Stage | Wait | Curr | Done
| Total | ms/op |
2017/07/31 09:59:15 [5738/1] STATS | 0: GET_FID | 0 | 0 | 0
| 0 | 0.00 |
2017/07/31 09:59:15 [5738/1] STATS | 1: GET_INFO_DB | 9438 | 9 | 3
| 396274 | 1.85 |
2017/07/31 09:59:15 [5738/1] STATS | 2: GET_INFO_FS | 0 | 0 | 0
| 396271 | 0.00 |
2017/07/31 09:59:15 [5738/1] STATS | 3: PRE_APPLY | 0 | 0 | 0
| 396271 | 0.00 |
2017/07/31 09:59:15 [5738/1] STATS | 4: DB_APPLY | 379 | 171 | 0
| 395721 | 0.21 | 100.00% batched (avg batch size: 116.2)
2017/07/31 09:59:15 [5738/1] STATS | 5: RM_OLD_ENTRIES | 0 | 0 | 0
| 0 | 0.00 |
2017/07/31 09:59:15 [5738/1] STATS | DB ops: get=293/ins=1/upd=395720/rm=0
2017/07/31 09:59:15 [5738/1] STATS | --- Pipeline stage details ---
2017/07/31 09:59:15 [5738/1] STATS | GET_INFO_DB : first:
/home/user01/A/B/run02/wrfV2756-0003-00478-000478-01.hdr, status=processing
2017/07/31 09:59:15 [5738/1] STATS | GET_INFO_DB : last:
/home/user01/A/B/run03/swrafV3158-0004-00623-000623-01.img, status=waiting
2017/07/31 09:59:15 [5738/1] STATS | DB_APPLY : first:
63BBC16B/16033637, status=processing
2017/07/31 09:59:15 [5738/1] STATS | DB_APPLY : last:
63BBC16B/397701125, status=waiting
2017/07/31 09:59:15 [5738/1] STATS | ======= cleanup policy: trigger stats
======
2017/07/31 09:59:15 [5738/1] STATS | Trigger #0 (scheduled) : running.
2017/07/31 09:59:15 [5738/1] STATS | last run: none
2017/07/31 09:59:15 [5738/1] STATS | total : none
2017/07/31 09:59:15 [5738/1] STATS | ======= cleanup policy: action stats
======
2017/07/31 09:59:15 [5738/1] STATS | idle threads = 5
2017/07/31 09:59:15 [5738/1] STATS | queued actions = 0
2017/07/31 09:59:15 [5738/1] STATS | action status:
2017/07/31 09:59:15 [5738/1] STATS | action successful = 0
2017/07/31 09:59:15 [5738/1] STATS | no matching rule =
8832
2017/07/31 09:59:15 [5738/1] STATS | 0 actions successful/8832, 0 (0 blocks)
2017/07/31 09:59:15 [5738/1] STATS | last action queued 437318 s ago
2017/07/31 09:59:15 [5738/1] STATS | last action started 437318 s ago
2017/07/31 09:59:15 [5738/1] STATS | last action completed 437318 s ago
As you can see, the GET_INFO_DB lines still refer to the same files.
- check processes in MySQL (show processlist)
$ mysqladmin processlist | grep home
| 555 | Query | 437722 | Locked | SELECT id FROM ENTRIES WHERE
id='63BBC16B:F39B83' |
| 556 | Query | 437722 | Locked | SELECT id FROM ENTRIES WHERE
id='63BBC16B:F4A805' |
| 557 | Query | 437722 | Locked | SELECT id FROM ENTRIES WHERE
id='63BBC16B:F4A806' |
| 558 | Query | 437722 | Locked | SELECT id FROM ENTRIES WHERE
id='63BBC16B:F42F31' |
| 559 | Query | 437722 | Locked | SELECT id FROM ENTRIES WHERE
id='63BBC16B:F3CB45' |
| 560 | Query | 437722 | Locked | SELECT id FROM ENTRIES WHERE
id='63BBC16B:F42F32' |
| 561 | Query | 437722 | Locked | SELECT id FROM ENTRIES WHERE
id='63BBC16B:F4A837' |
| 562 | Query | 437722 | Locked | SELECT id FROM ENTRIES WHERE
id='63BBC16B:F39B82' |
| 563 | Query | 437722 | Locked | SELECT id FROM ENTRIES WHERE
id='63BBC16B:17B3807D' |
| 564 | Query | 437722 | Locked | INSERT INTO
NAMES(id,parent_id,name,path_update,pkn) VALUES
('63BBC16B:F4A765','63BBC16B:F46FDD','wr |
| 567 | Query | 437722 | Locked | UPDATE ENTRIES SET
uid='501',gid='staff',size=3983,blocks=8,last_access=1487848194,last_mod=14572268
|
| 571 | Query | 437722 | statistics | SELECT parent_id, name INTO
pid, n from NAMES WHERE id= NAME_CONST('pid',_binary'63BBC16B:E563E8' CO |
| 572 | Sleep | 404 | |
|
| 21103 | Query | 194638 | Locked | SELECT ENTRIES.id AS id FROM
ENTRIES WHERE ENTRIES.type='file' AND (ENTRIES.last_access<1498701721 O |
On subsequent calls only process 571 seems to change, everything else is locked:
| 571 | Query | 437832 | Sorting result | SELECT parent_id, name INTO
pid, n from NAMES WHERE id= NAME_CONST('pid',_binary'63BBC16B:E563E8' CO |
| 571 | Query | 437873 | Sorting result | SELECT parent_id, name INTO
pid, n from NAMES WHERE id= NAME_CONST('pid',_binary'63BBC16B:E563DE' CO |
| 571 | Query | 437888 | statistics | SELECT parent_id, name INTO
pid, n from NAMES WHERE id= NAME_CONST('pid',_binary'63BBC16B:E563DE' CO |
| 571 | Query | 437915 | | SELECT CONCAT(
NAME_CONST('n',_binary'RAxML_log.At5g20170_mafftus_besttree.RUN.0' COLLATE
'binary') |
- in RH scan configuration do you have any of these ?
scan_op_timeout = 60min;
exit_on_timeout = TRUE;
I don't have any timeouts set.
I don't think a change in configuration caused the problem. I think it
is more to do with a very large change in the number of files at some
point (the file system was filling up and we convinced a number of users
to delete stuff).
I'm not particularly attached to the data, so what would be the best way
to just drop the existing data and rescan?
Cheers,
Loris
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
robinhood-support mailing list
robinhood-support@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/robinhood-support