I have a file system that I have tried to scan multiple times and it alway gets stuck after what I believe is when it is at the very end of the process. I thought for a long time that it was a problem with my lustre file system (fsck) but recently that was performed and the scans still will hang.
I have been looking through the archives at any post having to do with stalled scan and came across the gstack command that I had not heard of before. Running the command against my running process produces this: [root@rbh-server lustre]# gstack 33957 Thread 1 (process 33957): #0 0x00007f852466a6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000000000414583 in wait_scan_finished () at fs_scan.c:187 #2 0x0000000000411b05 in FSScan_Wait () at fs_scan_main.c:104 #3 0x000000000040ebeb in main (argc=<optimized out>, argv=<optimized out>) at rbh_daemon.c:1793 I don’t know if this is telling me that the scan is finished or is just waiting on the event. My job log show this: 2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | ==================== Dumping stats at 2018/03/05 14:01:32 ===================== 2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | ======== General statistics ========= 2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | Daemon start time: 2018/03/01 12:29:08 2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | Started modules: scan 2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | ======== FS scan statistics ========= 2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | current scan interval = 18.0h 2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | scan is running: 2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | started at : 2018/03/01 12:29:08 (4.1d ago) 2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | last action: 2018/03/04 05:02:04 (1.4d ago) 2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | progress : 728514630 entries scanned (1157 errors) 2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | inst. speed (potential): 3109.06 entries/sec (5.15 ms/entry/thread) 2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | avg. speed (effective): 2074.69 entries/sec (6.64 ms/entry/thread) 2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | ==== EntryProcessor Pipeline Stats === 2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | Idle threads: 32 2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | Id constraints count: 0 (hash min=0/max=0/avg=0.0) 2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | Name constraints count: 0 (hash min=0/max=0/avg=0.0) 2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | Stage | Wait | Curr | Done | Total | ms/op | 2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | 0: GET_FID | 0 | 0 | 0 | 0 | 0.00 | 2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | 1: GET_INFO_DB | 0 | 0 | 0 | 0 | 0.00 | 2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | 2: GET_INFO_FS | 0 | 0 | 0 | 0 | 0.00 | 2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | 3: PRE_APPLY | 0 | 0 | 0 | 0 | 0.00 | 2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | 4: DB_APPLY | 0 | 0 | 0 | 0 | 0.00 | 2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | 5: CHGLOG_CLR | 0 | 0 | 0 | 0 | 0.00 | 2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | 6: RM_OLD_ENTRIES | 0 | 0 | 0 | 0 | 0.00 | 2018/03/05 14:01:32 robinhood@rbh-server[33957/3] STATS | DB ops: get=12/ins=728472655/upd=40615/rm=0 2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | ==================== Dumping stats at 2018/03/05 14:02:32 ===================== 2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | ======== General statistics ========= 2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | Daemon start time: 2018/03/01 12:29:08 2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | Started modules: scan 2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | ======== FS scan statistics ========= 2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | current scan interval = 18.0h 2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | scan is running: 2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | started at : 2018/03/01 12:29:08 (4.1d ago) 2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | last action: 2018/03/04 05:02:04 (1.4d ago) 2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | progress : 728514630 entries scanned (1157 errors) 2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | inst. speed (potential): 3109.06 entries/sec (5.15 ms/entry/thread) 2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | avg. speed (effective): 2074.33 entries/sec (6.64 ms/entry/thread) 2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | ==== EntryProcessor Pipeline Stats === 2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | Idle threads: 32 2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | Id constraints count: 0 (hash min=0/max=0/avg=0.0) 2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | Name constraints count: 0 (hash min=0/max=0/avg=0.0) 2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | Stage | Wait | Curr | Done | Total | ms/op | 2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | 0: GET_FID | 0 | 0 | 0 | 0 | 0.00 | 2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | 1: GET_INFO_DB | 0 | 0 | 0 | 0 | 0.00 | 2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | 2: GET_INFO_FS | 0 | 0 | 0 | 0 | 0.00 | 2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | 3: PRE_APPLY | 0 | 0 | 0 | 0 | 0.00 | 2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | 4: DB_APPLY | 0 | 0 | 0 | 0 | 0.00 | 2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | 5: CHGLOG_CLR | 0 | 0 | 0 | 0 | 0.00 | 2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | 6: RM_OLD_ENTRIES | 0 | 0 | 0 | 0 | 0.00 | 2018/03/05 14:02:32 robinhood@rbh-server[33957/3] STATS | DB ops: get=12/ins=728472655/upd=40615/rm=0 The number of entries scanned has not changed all day and I assume that has been the case since 1.4 days ago. In my previous scan I cannot kill the process - or I should say it never dies. My only resolution has been to power cycle the system to get back to a sane state. I am running robinhood 3.1 on a system that’s running RHEL 7.3 against a lustre 2.8 file system that is ~4.5PB at ~80% capacity. The server is running an Intel E5-2637 will 256GB of memory. I’d really like to make full use of robinhood but until I can get a clean scan I can’t consider the data to be accurate. Any help would be greatly appreciated. ==== Joe Mervini Sandia National Laboratories High Performance Computing 505.844.6770 jame...@sandia.gov ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ robinhood-support mailing list robinhood-support@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/robinhood-support