Hi Loris,

to cleanup the database, you should

- stop Robinhood

- run '/usr/sbin/rbh-config empty_db [<db_name> [<db_admin_passwd>]]' to clean the database

- maybe clean your Lustre changelogs with '/usr/bin/lfs changelog_clear <mdtname> <reader_id> 0' (if you use changelogs in RBH, but I did not see 'log_reader' in your started modules in STATS)

- start Robinhood again (all tables will be recreated automatically)


Cheers, Carsten

--
Carsten Beyer
Abteilung Systeme

Deutsches Klimarechenzentrum GmbH (DKRZ)
Bundesstraße 45a * D-20146 Hamburg * Germany

Phone:  +49 40 460094-221
Fax:    +49 40 460094-270
Email:  be...@dkrz.de
URL:    http://www.dkrz.de

Geschäftsführer: Prof. Dr. Thomas Ludwig
Sitz der Gesellschaft: Hamburg
Amtsgericht Hamburg HRB 39784


On 31.07.2017 10:19, Loris Bennett wrote:
Hi,

Davide Tacchella <dtacchella@cr...> writes:

Hello,

On 07/28/17 09:25, Loris Bennett wrote:
Hi,

The scan on one of my file systems seems to be stuck

    Last filesystem scan:
            status:          running
            start:           2017/07/26 08:29:14 (2d 48min 33s ago)
            last action:     2017/07/26 08:30:37 (2d 47min 10s ago)

The log file repeatedly shows the same file being processed throughout
this period:

    2017/07/28 09:14:14 [5738/1] STATS | ======== General statistics =========
    2017/07/28 09:14:14 [5738/1] STATS | Daemon start time: 2017/07/26 08:29:14
    2017/07/28 09:14:14 [5738/1] STATS | Started modules: 
scan,policy_run(cleanup)
    2017/07/28 09:14:14 [5738/1] STATS | ======== FS scan statistics =========
    2017/07/28 09:14:14 [5738/1] STATS | current scan interval = 1.0d
    2017/07/28 09:14:14 [5738/1] STATS | scan is running:
    2017/07/28 09:14:14 [5738/1] STATS |      started at : 2017/07/26 08:29:14 
(2.0d ago)
    2017/07/28 09:14:14 [5738/1] STATS |      last action: 2017/07/26 08:30:37 
(2.0d ago)
    2017/07/28 09:14:14 [5738/1] STATS |      progress   : 387679 entries 
scanned (0 errors)
    2017/07/28 09:14:14 [5738/1] STATS |      inst. speed (potential):    
490.02 entries/sec (12.24 ms/entry/thread)
    2017/07/28 09:14:14 [5738/1] STATS |      avg. speed  (effective):      
2.21 entries/sec (1.22 ms/entry/thread)
    2017/07/28 09:14:14 [5738/1] STATS | ==== EntryProcessor Pipeline Stats ===
    2017/07/28 09:14:14 [5738/1] STATS | Idle threads: 0
    2017/07/28 09:14:14 [5738/1] STATS | Id constraints count: 10000 (hash 
min=0/max=6/avg=0.6)
    2017/07/28 09:14:14 [5738/1] STATS | Name constraints count: 10000 (hash 
min=0/max=6/avg=0.6)
    2017/07/28 09:14:14 [5738/1] STATS | Stage              | Wait | Curr | 
Done |     Total | ms/op |
    2017/07/28 09:14:14 [5738/1] STATS |  0: GET_FID        |    0 |    0 |    
0 |         0 |  0.00 |
    2017/07/28 09:14:14 [5738/1] STATS |  1: GET_INFO_DB    | 9438 |    9 |    
3 |    396274 |  1.85 |
    2017/07/28 09:14:14 [5738/1] STATS |  2: GET_INFO_FS    |    0 |    0 |    
0 |    396271 |  0.00 |
    2017/07/28 09:14:14 [5738/1] STATS |  3: PRE_APPLY      |    0 |    0 |    
0 |    396271 |  0.00 |
    2017/07/28 09:14:14 [5738/1] STATS |  4: DB_APPLY       |  379 |  171 |    
0 |    395721 |  0.21 | 100.00% batched (avg batch size: 116.2)
    2017/07/28 09:14:14 [5738/1] STATS |  5: RM_OLD_ENTRIES |    0 |    0 |    
0 |         0 |  0.00 |
    2017/07/28 09:14:14 [5738/1] STATS | DB ops: get=293/ins=1/upd=395720/rm=0
    2017/07/28 09:14:14 [5738/1] STATS | --- Pipeline stage details ---
    2017/07/28 09:14:14 [5738/1] STATS | GET_INFO_DB   : first: 
/home/user01/A/B/run02/wrfV2756-0003-00478-000478-01.hdr, status=processing
    2017/07/28 09:14:14 [5738/1] STATS | GET_INFO_DB   : last: 
/home/user01/A/B/run03/swrafV3158-0004-00623-000623-01.img, status=waiting
    2017/07/28 09:14:14 [5738/1] STATS | DB_APPLY      : first: 
63BBC16B/16033637, status=processing
    2017/07/28 09:14:14 [5738/1] STATS | DB_APPLY      : last: 
63BBC16B/397701125, status=waiting

Restarting mysql hasn't solved the problem.  The scan on my other file
system runs fine.

Any ideas what the problem is and how it can be addressed?

Are you sure the problem is RH ?
As the scan on one file system works and the one on the other file
system doesn't, my assumption is that there is some problem with one or
more of the database tables related to the second file system.

Current STATS snapshot shows DB is slowing you down, 9400 of 10k actions
are queued in get_db_info stage

for better understanding of the problem would be nice if you could add:
- at least 2 subsequent STATS showing no progress, you only provided
one
Here are the last two entries from today:

   2017/07/31 09:44:15 [5738/1] STATS | ==================== Dumping stats at 
2017/07/31 09:44:15 =====================
   2017/07/31 09:44:15 [5738/1] STATS | ======== General statistics =========
   2017/07/31 09:44:15 [5738/1] STATS | Daemon start time: 2017/07/26 08:29:14
   2017/07/31 09:44:15 [5738/1] STATS | Started modules: 
scan,policy_run(cleanup)
   2017/07/31 09:44:15 [5738/1] STATS | ======== FS scan statistics =========
   2017/07/31 09:44:15 [5738/1] STATS | current scan interval = 1.0d
   2017/07/31 09:44:15 [5738/1] STATS | scan is running:
   2017/07/31 09:44:15 [5738/1] STATS |      started at : 2017/07/26 08:29:14 
(5.1d ago)
   2017/07/31 09:44:15 [5738/1] STATS |      last action: 2017/07/26 08:30:37 
(5.1d ago)
   2017/07/31 09:44:15 [5738/1] STATS |      progress   : 387679 entries 
scanned (0 errors)
   2017/07/31 09:44:15 [5738/1] STATS |      inst. speed (potential):    490.02 
entries/sec (12.24 ms/entry/thread)
   2017/07/31 09:44:15 [5738/1] STATS |      avg. speed  (effective):      0.89 
entries/sec (1.22 ms/entry/thread)
   2017/07/31 09:44:15 [5738/1] STATS | ==== EntryProcessor Pipeline Stats ===
   2017/07/31 09:44:15 [5738/1] STATS | Idle threads: 0
   2017/07/31 09:44:15 [5738/1] STATS | Id constraints count: 10000 (hash 
min=0/max=6/avg=0.6)
   2017/07/31 09:44:15 [5738/1] STATS | Name constraints count: 10000 (hash 
min=0/max=6/avg=0.6)
   2017/07/31 09:44:15 [5738/1] STATS | Stage              | Wait | Curr | Done 
|     Total | ms/op |
   2017/07/31 09:44:15 [5738/1] STATS |  0: GET_FID        |    0 |    0 |    0 
|         0 |  0.00 |
   2017/07/31 09:44:15 [5738/1] STATS |  1: GET_INFO_DB    | 9438 |    9 |    3 
|    396274 |  1.85 |
   2017/07/31 09:44:15 [5738/1] STATS |  2: GET_INFO_FS    |    0 |    0 |    0 
|    396271 |  0.00 |
   2017/07/31 09:44:15 [5738/1] STATS |  3: PRE_APPLY      |    0 |    0 |    0 
|    396271 |  0.00 |
   2017/07/31 09:44:15 [5738/1] STATS |  4: DB_APPLY       |  379 |  171 |    0 
|    395721 |  0.21 | 100.00% batched (avg batch size: 116.2)
   2017/07/31 09:44:15 [5738/1] STATS |  5: RM_OLD_ENTRIES |    0 |    0 |    0 
|         0 |  0.00 |
   2017/07/31 09:44:15 [5738/1] STATS | DB ops: get=293/ins=1/upd=395720/rm=0
   2017/07/31 09:44:15 [5738/1] STATS | --- Pipeline stage details ---
   2017/07/31 09:44:15 [5738/1] STATS | GET_INFO_DB   : first: 
/home/user01/A/B/run02/wrfV2756-0003-00478-000478-01.hdr, status=processing
   2017/07/31 09:44:15 [5738/1] STATS | GET_INFO_DB   : last: 
/home/user01/A/B/run03/swrafV3158-0004-00623-000623-01.img, status=waiting
   2017/07/31 09:44:15 [5738/1] STATS | DB_APPLY      : first: 
63BBC16B/16033637, status=processing
   2017/07/31 09:44:15 [5738/1] STATS | DB_APPLY      : last: 
63BBC16B/397701125, status=waiting
   2017/07/31 09:44:15 [5738/1] STATS | ======= cleanup policy: trigger stats 
======
   2017/07/31 09:44:15 [5738/1] STATS | Trigger #0 (scheduled)        : running.
   2017/07/31 09:44:15 [5738/1] STATS |     last run: none
   2017/07/31 09:44:15 [5738/1] STATS |     total   : none
   2017/07/31 09:44:15 [5738/1] STATS | ======= cleanup policy: action stats 
======
   2017/07/31 09:44:15 [5738/1] STATS | idle threads       = 5
   2017/07/31 09:44:15 [5738/1] STATS | queued actions     = 0
   2017/07/31 09:44:15 [5738/1] STATS | action status:
   2017/07/31 09:44:15 [5738/1] STATS |     action successful              = 0
   2017/07/31 09:44:15 [5738/1] STATS |     no matching rule               = 
8832
   2017/07/31 09:44:15 [5738/1] STATS | 0 actions successful/8832, 0 (0 blocks)
   2017/07/31 09:44:15 [5738/1] STATS | last action queued    436418 s ago
   2017/07/31 09:44:15 [5738/1] STATS | last action started   436418 s ago
   2017/07/31 09:44:15 [5738/1] STATS | last action completed 436418 s ago
   2017/07/31 09:59:15 [5738/1] STATS | ==================== Dumping stats at 
2017/07/31 09:59:15 =====================
   2017/07/31 09:59:15 [5738/1] STATS | ======== General statistics =========
   2017/07/31 09:59:15 [5738/1] STATS | Daemon start time: 2017/07/26 08:29:14
   2017/07/31 09:59:15 [5738/1] STATS | Started modules: 
scan,policy_run(cleanup)
   2017/07/31 09:59:15 [5738/1] STATS | ======== FS scan statistics =========
   2017/07/31 09:59:15 [5738/1] STATS | current scan interval = 1.0d
   2017/07/31 09:59:15 [5738/1] STATS | scan is running:
   2017/07/31 09:59:15 [5738/1] STATS |      started at : 2017/07/26 08:29:14 
(5.1d ago)
   2017/07/31 09:59:15 [5738/1] STATS |      last action: 2017/07/26 08:30:37 
(5.1d ago)
   2017/07/31 09:59:15 [5738/1] STATS |      progress   : 387679 entries 
scanned (0 errors)
   2017/07/31 09:59:15 [5738/1] STATS |      inst. speed (potential):    490.02 
entries/sec (12.24 ms/entry/thread)
   2017/07/31 09:59:15 [5738/1] STATS |      avg. speed  (effective):      0.89 
entries/sec (1.22 ms/entry/thread)
   2017/07/31 09:59:15 [5738/1] STATS | ==== EntryProcessor Pipeline Stats ===
   2017/07/31 09:59:15 [5738/1] STATS | Idle threads: 0
   2017/07/31 09:59:15 [5738/1] STATS | Id constraints count: 10000 (hash 
min=0/max=6/avg=0.6)
   2017/07/31 09:59:15 [5738/1] STATS | Name constraints count: 10000 (hash 
min=0/max=6/avg=0.6)
   2017/07/31 09:59:15 [5738/1] STATS | Stage              | Wait | Curr | Done 
|     Total | ms/op |
   2017/07/31 09:59:15 [5738/1] STATS |  0: GET_FID        |    0 |    0 |    0 
|         0 |  0.00 |
   2017/07/31 09:59:15 [5738/1] STATS |  1: GET_INFO_DB    | 9438 |    9 |    3 
|    396274 |  1.85 |
   2017/07/31 09:59:15 [5738/1] STATS |  2: GET_INFO_FS    |    0 |    0 |    0 
|    396271 |  0.00 |
   2017/07/31 09:59:15 [5738/1] STATS |  3: PRE_APPLY      |    0 |    0 |    0 
|    396271 |  0.00 |
   2017/07/31 09:59:15 [5738/1] STATS |  4: DB_APPLY       |  379 |  171 |    0 
|    395721 |  0.21 | 100.00% batched (avg batch size: 116.2)
   2017/07/31 09:59:15 [5738/1] STATS |  5: RM_OLD_ENTRIES |    0 |    0 |    0 
|         0 |  0.00 |
   2017/07/31 09:59:15 [5738/1] STATS | DB ops: get=293/ins=1/upd=395720/rm=0
   2017/07/31 09:59:15 [5738/1] STATS | --- Pipeline stage details ---
   2017/07/31 09:59:15 [5738/1] STATS | GET_INFO_DB   : first: 
/home/user01/A/B/run02/wrfV2756-0003-00478-000478-01.hdr, status=processing
   2017/07/31 09:59:15 [5738/1] STATS | GET_INFO_DB   : last: 
/home/user01/A/B/run03/swrafV3158-0004-00623-000623-01.img, status=waiting
   2017/07/31 09:59:15 [5738/1] STATS | DB_APPLY      : first: 
63BBC16B/16033637, status=processing
   2017/07/31 09:59:15 [5738/1] STATS | DB_APPLY      : last: 
63BBC16B/397701125, status=waiting
   2017/07/31 09:59:15 [5738/1] STATS | ======= cleanup policy: trigger stats 
======
   2017/07/31 09:59:15 [5738/1] STATS | Trigger #0 (scheduled)        : running.
   2017/07/31 09:59:15 [5738/1] STATS |     last run: none
   2017/07/31 09:59:15 [5738/1] STATS |     total   : none
   2017/07/31 09:59:15 [5738/1] STATS | ======= cleanup policy: action stats 
======
   2017/07/31 09:59:15 [5738/1] STATS | idle threads       = 5
   2017/07/31 09:59:15 [5738/1] STATS | queued actions     = 0
   2017/07/31 09:59:15 [5738/1] STATS | action status:
   2017/07/31 09:59:15 [5738/1] STATS |     action successful              = 0
   2017/07/31 09:59:15 [5738/1] STATS |     no matching rule               = 
8832
   2017/07/31 09:59:15 [5738/1] STATS | 0 actions successful/8832, 0 (0 blocks)
   2017/07/31 09:59:15 [5738/1] STATS | last action queued    437318 s ago
   2017/07/31 09:59:15 [5738/1] STATS | last action started   437318 s ago
   2017/07/31 09:59:15 [5738/1] STATS | last action completed 437318 s ago

As you can see, the GET_INFO_DB lines still refer to the same files.

- check processes in MySQL (show processlist)
   $ mysqladmin processlist | grep home
   | 555   | Query   | 437722 | Locked         | SELECT id FROM ENTRIES WHERE 
id='63BBC16B:F39B83'                                                    |
   | 556   | Query   | 437722 | Locked         | SELECT id FROM ENTRIES WHERE 
id='63BBC16B:F4A805'                                                    |
   | 557   | Query   | 437722 | Locked         | SELECT id FROM ENTRIES WHERE 
id='63BBC16B:F4A806'                                                    |
   | 558   | Query   | 437722 | Locked         | SELECT id FROM ENTRIES WHERE 
id='63BBC16B:F42F31'                                                    |
   | 559   | Query   | 437722 | Locked         | SELECT id FROM ENTRIES WHERE 
id='63BBC16B:F3CB45'                                                    |
   | 560   | Query   | 437722 | Locked         | SELECT id FROM ENTRIES WHERE 
id='63BBC16B:F42F32'                                                    |
   | 561   | Query   | 437722 | Locked         | SELECT id FROM ENTRIES WHERE 
id='63BBC16B:F4A837'                                                    |
   | 562   | Query   | 437722 | Locked         | SELECT id FROM ENTRIES WHERE 
id='63BBC16B:F39B82'                                                    |
   | 563   | Query   | 437722 | Locked         | SELECT id FROM ENTRIES WHERE 
id='63BBC16B:17B3807D'                                                  |
   | 564   | Query   | 437722 | Locked         | INSERT INTO 
NAMES(id,parent_id,name,path_update,pkn) VALUES 
('63BBC16B:F4A765','63BBC16B:F46FDD','wr |
   | 567   | Query   | 437722 | Locked         | UPDATE ENTRIES SET 
uid='501',gid='staff',size=3983,blocks=8,last_access=1487848194,last_mod=14572268
 |
   | 571   | Query   | 437722 | statistics     | SELECT parent_id, name INTO 
pid, n from NAMES WHERE id= NAME_CONST('pid',_binary'63BBC16B:E563E8' CO |
   | 572   | Sleep   | 404    |                |                                
                                                                      |
   | 21103 | Query   | 194638 | Locked         | SELECT ENTRIES.id AS id FROM 
ENTRIES WHERE ENTRIES.type='file' AND (ENTRIES.last_access<1498701721 O |

On subsequent calls only process 571 seems to change, everything else is locked:

   | 571   | Query   | 437832 | Sorting result | SELECT parent_id, name INTO 
pid, n from NAMES WHERE id= NAME_CONST('pid',_binary'63BBC16B:E563E8' CO |

   | 571   | Query   | 437873 | Sorting result | SELECT parent_id, name INTO 
pid, n from NAMES WHERE id= NAME_CONST('pid',_binary'63BBC16B:E563DE' CO |

   | 571   | Query   | 437888 | statistics     | SELECT parent_id, name INTO 
pid, n from NAMES WHERE id= NAME_CONST('pid',_binary'63BBC16B:E563DE' CO |

   | 571   | Query   | 437915 |                | SELECT CONCAT(  
NAME_CONST('n',_binary'RAxML_log.At5g20170_mafftus_besttree.RUN.0' COLLATE 
'binary') |

- in RH scan configuration do you have any of these ?
      scan_op_timeout        =    60min;
      exit_on_timeout        =    TRUE;
I don't have any timeouts set.

I don't think a change in configuration caused the problem.  I think it
is more to do with a very large change in the number of files at some
point (the file system was filling up and we convinced a number of users
to delete stuff).

I'm not particularly attached to the data, so what would be the best way
to just drop the existing data and rescan?

Cheers,

Loris



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
robinhood-support mailing list
robinhood-support@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/robinhood-support

Reply via email to