I have a file system that I have tried to scan multiple times and it alway gets 
stuck after what I believe is when it is at the very end of the process. I 
thought for a long time that it was a problem with my lustre file system (fsck) 
but recently that was performed and the scans still will hang.

I have been looking through the archives at any post having to do with stalled 
scan and came across the gstack command that I had not heard of before. Running 
the command against my running process produces this:

[root@rbh-server lustre]# gstack 33957
Thread 1 (process 33957):
#0  0x00007f852466a6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/libpthread.so.0
#1  0x0000000000414583 in wait_scan_finished () at fs_scan.c:187
#2  0x0000000000411b05 in FSScan_Wait () at fs_scan_main.c:104
#3  0x000000000040ebeb in main (argc=<optimized out>, argv=<optimized out>) at 
rbh_daemon.c:1793

I don’t know if this is telling me that the scan is finished or is just waiting 
on the event.  My job log show this:

2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | ==================== 
Dumping stats at 2018/03/05 14:01:32 =====================
2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | ======== General 
statistics =========
2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | Daemon start time: 
2018/03/01 12:29:08
2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | Started modules: scan
2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | ======== FS scan 
statistics =========
2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | current scan interval 
= 18.0h
2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | scan is running:
2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS |      started at : 
2018/03/01 12:29:08 (4.1d ago)
2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS |      last action: 
2018/03/04 05:02:04 (1.4d ago)
2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS |      progress   : 
728514630 entries scanned (1157 errors)
2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS |      inst. speed 
(potential):   3109.06 entries/sec (5.15 ms/entry/thread)
2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS |      avg. speed  
(effective):   2074.69 entries/sec (6.64 ms/entry/thread)
2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | ==== EntryProcessor 
Pipeline Stats ===
2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | Idle threads: 32
2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | Id constraints count: 
0 (hash min=0/max=0/avg=0.0)
2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | Name constraints 
count: 0 (hash min=0/max=0/avg=0.0)
2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | Stage              | 
Wait | Curr | Done |     Total | ms/op |
2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS |  0: GET_FID        |  
  0 |    0 |    0 |         0 |  0.00 |
2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS |  1: GET_INFO_DB    |  
  0 |    0 |    0 |         0 |  0.00 |
2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS |  2: GET_INFO_FS    |  
  0 |    0 |    0 |         0 |  0.00 |
2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS |  3: PRE_APPLY      |  
  0 |    0 |    0 |         0 |  0.00 |
2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS |  4: DB_APPLY       |  
  0 |    0 |    0 |         0 |  0.00 |
2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS |  5: CHGLOG_CLR     |  
  0 |    0 |    0 |         0 |  0.00 |
2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS |  6: RM_OLD_ENTRIES |  
  0 |    0 |    0 |         0 |  0.00 |
2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | DB ops: 
get=12/ins=728472655/upd=40615/rm=0
2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | ==================== 
Dumping stats at 2018/03/05 14:02:32 =====================
2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | ======== General 
statistics =========
2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | Daemon start time: 
2018/03/01 12:29:08
2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | Started modules: scan
2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | ======== FS scan 
statistics =========
2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | current scan interval 
= 18.0h
2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | scan is running:
2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS |      started at : 
2018/03/01 12:29:08 (4.1d ago)
2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS |      last action: 
2018/03/04 05:02:04 (1.4d ago)
2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS |      progress   : 
728514630 entries scanned (1157 errors)
2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS |      inst. speed 
(potential):   3109.06 entries/sec (5.15 ms/entry/thread)
2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS |      avg. speed  
(effective):   2074.33 entries/sec (6.64 ms/entry/thread)
2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | ==== EntryProcessor 
Pipeline Stats ===
2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | Idle threads: 32
2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | Id constraints count: 
0 (hash min=0/max=0/avg=0.0)
2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | Name constraints 
count: 0 (hash min=0/max=0/avg=0.0)
2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | Stage              | 
Wait | Curr | Done |     Total | ms/op |
2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS |  0: GET_FID        |  
  0 |    0 |    0 |         0 |  0.00 |
2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS |  1: GET_INFO_DB    |  
  0 |    0 |    0 |         0 |  0.00 |
2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS |  2: GET_INFO_FS    |  
  0 |    0 |    0 |         0 |  0.00 |
2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS |  3: PRE_APPLY      |  
  0 |    0 |    0 |         0 |  0.00 |
2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS |  4: DB_APPLY       |  
  0 |    0 |    0 |         0 |  0.00 |
2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS |  5: CHGLOG_CLR     |  
  0 |    0 |    0 |         0 |  0.00 |
2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS |  6: RM_OLD_ENTRIES |  
  0 |    0 |    0 |         0 |  0.00 |
2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | DB ops: 
get=12/ins=728472655/upd=40615/rm=0

The number of entries scanned has not changed all day and I assume that has 
been the case since 1.4 days ago. 

In my previous scan I cannot kill the process - or I should say it never dies. 
My only resolution has been to power cycle the system to get back to a sane 
state. 

I am running robinhood 3.1 on a system that’s running RHEL 7.3 against a lustre 
2.8 file system that is ~4.5PB at ~80% capacity. The server is running an Intel 
E5-2637 will 256GB of memory.

I’d really like to make full use of robinhood but until I can get a clean scan 
I can’t consider the data to be accurate. Any help would be greatly appreciated.

  
====

Joe Mervini
Sandia National Laboratories
High Performance Computing
505.844.6770
jame...@sandia.gov



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
robinhood-support mailing list
robinhood-support@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/robinhood-support

Reply via email to