Hello, We are running robinhood v3.0 against a lustre 2.7 filesystem and using the LHSM policy to archive the filesystem.
I am doing some testing at the moment of restoring directories using the rbh-undelete command and I am running into a segmentation fault when using the command to restore a directory that has been deleted. What I find notable is that the command will reliably restore two files from the directory and segfault when restoring the 3rd file, every time. If you then run it again, it will again restore another 2 files, and segfault on the 3rd. I've installed debuginfo packages and here is a stacktrace from gdb after a crash. You can see that I am trying to restore a directory that contained 5 files all in state 'synchro', and the segfault happens after the first two files are successfully restored. Here is the rebind_cmd we are using: lhsm_config { # used for "undelete": command to change the fid of an entry in archive rebind_cmd = "/usr/sbin/lhsmtool_posix --hsm_root=/mnt/qstar/rds-d1/lhsm --archive {archive_id} --rebind {oldfid} {newfid} {fsroot}"; # for UUID-based mapping uuid { xattr = "trusted.lhsm_uuid"; } } [root@rbh-rds-data robinhood-src]# gdb rpms/BUILD/robinhood-3.0/src/robinhood/rbh-undelete GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /root/robinhood-src/rpms/BUILD/robinhood-3.0/src/robinhood/rbh-undelete...done. (gdb) run -L /rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/ Starting program: /root/robinhood-src/rpms/BUILD/robinhood-3.0/src/robinhood/rbh-undelete -L /rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/ [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Using config file '/etc/robinhood.d/rds_d1.conf'. 2017/01/05 21:36:35 [15513/1] CheckFS | '/rds-d1' matches mount point '/rds-d1', type=lustre, fs=10.143.240.30@tcp1:10.143.240.29@tcp1:/rds-d1 [New Thread 0x7ffff3ae1700 (LWP 15517)] [Thread 0x7ffff3ae1700 (LWP 15517) exited] rm_time, id, type, user, group, size, last_mod, lhsm.status, path 2016/12/26 21:07:32, [0x200000ddb:0xe45b:0x0], file, wjt27, wjt27, 13.00 KB, 2016/10/28 08:51:31, synchro, /rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/zd_rf_al7230 b.c 2016/12/26 21:07:32, [0x200000ddb:0xe45c:0x0], file, wjt27, wjt27, 8.62 KB, 2016/10/28 08:51:31, synchro, /rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/zd_rf_rf2959 .c 2016/12/26 21:07:32, [0x200000ddb:0xe45d:0x0], file, wjt27, wjt27, 15.34 KB, 2016/10/28 08:51:31, synchro, /rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/zd_rf_uw2453 .c 2016/12/26 21:07:32, [0x200000ddb:0xe45e:0x0], file, wjt27, wjt27, 50.44 KB, 2016/10/28 08:51:31, synchro, /rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/zd_usb.c 2016/12/26 21:07:32, [0x200000ddb:0xe45f:0x0], file, wjt27, wjt27, 7.30 KB, 2016/10/28 08:51:31, synchro, /rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/zd_usb.h [Inferior 1 (process 15513) exited normally] Missing separate debuginfos, use: debuginfo-install libuuid-2.23.2-33.el7.x86_64 mariadb-libs-5.5.52-1.el7.x86_64 pcre-8.32-15.el7_2.1.x86_64 (gdb) run -R /rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/ Starting program: /root/robinhood-src/rpms/BUILD/robinhood-3.0/src/robinhood/rbh-undelete -R /rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/ [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Using config file '/etc/robinhood.d/rds_d1.conf'. 2017/01/05 21:36:48 [15519/1] CheckFS | '/rds-d1' matches mount point '/rds-d1', type=lustre, fs=10.143.240.30@tcp1:10.143.240.29@tcp1:/rds-d1 [New Thread 0x7ffff3ae1700 (LWP 15520)] [Thread 0x7ffff3ae1700 (LWP 15520) exited] Restoring '/rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/zd_rf_al7230b.c'... restore OK (file) Entry successfully updated in the dabatase Restoring '/rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/zd_rf_rf2959.c'... restore OK (file) Entry successfully updated in the dabatase Program received signal SIGSEGV, Segmentation fault. malloc_consolidate (av=av@entry=0x7ffff61f3760 <main_arena>) at malloc.c:4146 4146 size = p->size & ~(PREV_INUSE|NON_MAIN_ARENA); Missing separate debuginfos, use: debuginfo-install sssd-client-1.14.0-43.el7_3.4.x86_64 (gdb) bt #0 malloc_consolidate (av=av@entry=0x7ffff61f3760 <main_arena>) at malloc.c:4146 #1 0x00007ffff5eb6385 in _int_malloc (av=av@entry=0x7ffff61f3760 <main_arena>, bytes=bytes@entry=4096) at malloc.c:3436 #2 0x00007ffff5eb8fbc in __GI___libc_malloc (bytes=4096) at malloc.c:2893 #3 0x00007ffff5e7b60c in __realpath (name=0x7fffffffbd24 "/rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/zd_rf_uw2453.c", resolved=0x0) at canonicalize.c:78 #4 0x00007ffff6b6f98a in llapi_search_fsname (pathname=pathname@entry=0x7fffffffbd24 "/rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/zd_rf_uw2453.c", fsname=fsname@entry=0x7fffffff6b70 "") at liblustreapi.c:1173 #5 0x00007ffff6b6fb0e in llapi_file_open_param (name=name@entry=0x7fffffffbd24 "/rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/zd_rf_uw2453.c", flags=flags@entry=65, mode=436, param=param@entry=0x7fffffff6cd0) at liblustreapi.c:685 #6 0x00007ffff6b6ff75 in llapi_file_open_pool (name=name@entry=0x7fffffffbd24 "/rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/zd_rf_uw2453.c", flags=flags@entry=65, mode=<optimized out>, stripe_size=stripe_size@entry=0, stripe_offset=stripe_offset@entry=-1, stripe_count=stripe_count@entry=0, stripe_pattern=stripe_pattern@entry=-2147483647, pool_name=pool_name@entry=0x0) at liblustreapi.c:849 #7 0x00007ffff6b749f5 in llapi_hsm_import (dst=dst@entry=0x7fffffffbd24 "/rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/zd_rf_uw2453.c", archive=archive@entry=1, st=st@entry=0x7fffffff6e00, stripe_size=stripe_size@entry=0, stripe_offset=stripe_offset@entry=-1, stripe_count=stripe_count@entry=0, stripe_pattern=<optimized out>, stripe_pattern@entry=0, pool_name=pool_name@entry=0x0, newfid=newfid@entry=0x7fffffff6fd0) at liblustreapi_hsm.c:1333 #8 0x00007ffff42f7209 in lhsm_undelete (smi=0x6a0fb0, p_old_id=0x7fffffff9720, p_attrs_old_in=0x7fffffffbc00, p_new_id=0x7fffffff6fd0, p_attrs_new=0x7fffffff6fe0, already_recovered=<optimized out>) at lhsm.c:915 #9 0x000000000040c289 in undelete_helper (id=id@entry=0x7fffffff9720, attrs=attrs@entry=0x7fffffffbc00) at rbh_undelete.c:329 #10 0x000000000040bd37 in undelete () at rbh_undelete.c:440 #11 main (argc=<optimized out>, argv=<optimized out>) at rbh_undelete.c:712 I can then run it again on the same directory and it will again restore another two files before segfaulting again. (gdb) run -L /rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/ The program being debugged has been started already. Start it from the beginning? (y or n) y Starting program: /root/robinhood-src/rpms/BUILD/robinhood-3.0/src/robinhood/rbh-undelete -L /rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/ [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Using config file '/etc/robinhood.d/rds_d1.conf'. 2017/01/05 21:37:00 [15521/1] CheckFS | '/rds-d1' matches mount point '/rds-d1', type=lustre, fs=10.143.240.30@tcp1:10.143.240.29@tcp1:/rds-d1 [New Thread 0x7ffff3ae1700 (LWP 15522)] [Thread 0x7ffff3ae1700 (LWP 15522) exited] rm_time, id, type, user, group, size, last_mod, lhsm.status, path 2016/12/26 21:07:32, [0x200000ddb:0xe45d:0x0], file, wjt27, wjt27, 15.34 KB, 2016/10/28 08:51:31, synchro, /rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/zd_rf_uw2453 .c 2016/12/26 21:07:32, [0x200000ddb:0xe45e:0x0], file, wjt27, wjt27, 50.44 KB, 2016/10/28 08:51:31, synchro, /rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/zd_usb.c 2016/12/26 21:07:32, [0x200000ddb:0xe45f:0x0], file, wjt27, wjt27, 7.30 KB, 2016/10/28 08:51:31, synchro, /rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/zd_usb.h [Inferior 1 (process 15521) exited normally] (gdb) run -R /rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/ Starting program: /root/robinhood-src/rpms/BUILD/robinhood-3.0/src/robinhood/rbh-undelete -R /rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/ [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Using config file '/etc/robinhood.d/rds_d1.conf'. 2017/01/05 21:37:33 [15627/1] CheckFS | '/rds-d1' matches mount point '/rds-d1', type=lustre, fs=10.143.240.30@tcp1:10.143.240.29@tcp1:/rds-d1 [New Thread 0x7ffff3ae1700 (LWP 15628)] [Thread 0x7ffff3ae1700 (LWP 15628) exited] Restoring '/rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/zd_rf_uw2453.c'... restore OK (file) Entry successfully updated in the dabatase Restoring '/rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/zd_usb.c'... restore OK (file) Entry successfully updated in the dabatase Program received signal SIGSEGV, Segmentation fault. malloc_consolidate (av=av@entry=0x7ffff61f3760 <main_arena>) at malloc.c:4146 4146 size = p->size & ~(PREV_INUSE|NON_MAIN_ARENA); And then finally running again, it restores one file and then exits cleanly. (gdb) run -L /rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/ The program being debugged has been started already. Start it from the beginning? (y or n) y Starting program: /root/robinhood-src/rpms/BUILD/robinhood-3.0/src/robinhood/rbh-undelete -L /rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/ [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Using config file '/etc/robinhood.d/rds_d1.conf'. 2017/01/05 21:37:38 [15629/1] CheckFS | '/rds-d1' matches mount point '/rds-d1', type=lustre, fs=10.143.240.30@tcp1:10.143.240.29@tcp1:/rds-d1 [New Thread 0x7ffff3ae1700 (LWP 15630)] [Thread 0x7ffff3ae1700 (LWP 15630) exited] rm_time, id, type, user, group, size, last_mod, lhsm.status, path 2016/12/26 21:07:32, [0x200000ddb:0xe45f:0x0], file, wjt27, wjt27, 7.30 KB, 2016/10/28 08:51:31, synchro, /rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/zd_usb.h [Inferior 1 (process 15629) exited normally] (gdb) run -R /rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/ Starting program: /root/robinhood-src/rpms/BUILD/robinhood-3.0/src/robinhood/rbh-undelete -R /rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/ [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Using config file '/etc/robinhood.d/rds_d1.conf'. 2017/01/05 21:37:48 [15633/1] CheckFS | '/rds-d1' matches mount point '/rds-d1', type=lustre, fs=10.143.240.30@tcp1:10.143.240.29@tcp1:/rds-d1 [New Thread 0x7ffff3ae1700 (LWP 15634)] [Thread 0x7ffff3ae1700 (LWP 15634) exited] Restoring '/rds-d1/user/wjt27/HSM-testing/dir1/linux-4.8.5/drivers/net/wireless/zydas/zd1211rw/zd_usb.h'... restore OK (file) Entry successfully updated in the dabatase undelete summary: 1 files 0 old version 0 empty files 0 non-files 0 no backup 0 errors 0 DB errors [Inferior 1 (process 15633) exited normally] I was wondering if anyone else using HSM has seen or can reproduce this crash? I'm afraid my C experience is very rusty but I am trying to understand the code to see if I can spot where it is failing - any pointers here would be most welcome! Kind regards, -- Matt Rásó-Barnett Research Computing Platforms University Information Services High Performance Computing Service University of Cambridge Email: mjr...@cam.ac.uk <mailto:mjr...@cam.ac.uk> ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ robinhood-support mailing list robinhood-support@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/robinhood-support