I added the following block to my config file to deal with the the watchdog problem:
FS_Scan { Ignore { path == "**/dev/watchdog" or path == "**/dev/watchdog0" } } ==== Joe Mervini Sandia National Laboratories High Performance Computing 505.844.6770 jame...@sandia.gov > On Aug 23, 2017, at 11:50 AM, Mervini, Joseph A <jame...@sandia.gov> wrote: > > I am able to verify that the cause of the failures was in fact due to > /dev/watchdog and /dev/watchdog0 existing on the lustre file system. The > problem was easily duplicated by confining the scan to the dev directory and > in each case before the individual files were removed the system would > reboot. After removing the files the scan ran to completion. > > Thanks again to Justin Miller for the information and explanation. > ==== > > Joe Mervini > Sandia National Laboratories > High Performance Computing > 505.844.6770 > jame...@sandia.gov > > > >> On Aug 23, 2017, at 10:28 AM, Mervini, Joseph A <jame...@sandia.gov> wrote: >> >> Justin >> >> Yes - as a matter of fact I have copied the entire OS to the lustre file >> system to create a ton of real files. I am testing that hypothesis now. >> >> Thanks so much for the suggestion! I will post my results. >> ==== >> >> Joe Mervini >> Sandia National Laboratories >> High Performance Computing >> 505.844.6770 >> jame...@sandia.gov >> >> >> >>> On Aug 23, 2017, at 10:03 AM, Justin Miller <jmil...@cray.com> wrote: >>> >>> Does the data you’re scanning include backups of OS root directories? >>> >>> We have seen the case where a scan causes a system to reboot with almost no >>> explanation because the data being scanned included a copy of /dev from a >>> OS root filesystem backup, and inside that copy of /dev was a character >>> device file for a watchdog device handler. The Lustre client running the >>> scan also had the modules loaded to handle the watchdog. >>> >>> The watchdog timer starts during the scan when path2fid does an open. The >>> watchdog timer isn’t terminated until a special flag is specified, and >>> Lustre doesn’t know about the flag, so the watchdog times out and reboots >>> the system. >>> >>> - Justin Miller >>> >>>> On Aug 23, 2017, at 9:55 AM, Mervini, Joseph A <jame...@sandia.gov> wrote: >>>> >>>> I just re-ran the scan using strace -f and this is the tail of the output >>>> up until the machine rebooted: >>>> >>>> 2017/08/23 08:39:44 [13249/7] ListMgr | SQL query: COMMIT >>>> [pid 13254] <... futex resumed> ) = -1 EAGAIN (Resource temporarily >>>> unavailable) >>>> [pid 13267] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished >>>> ...> >>>> [pid 13263] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished >>>> ...> >>>> [pid 13271] <... ioctl resumed> , 0x7f28351b9490) = 0 >>>> [pid 13257] <... write resumed> ) = 58 >>>> [pid 13254] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished >>>> ...> >>>> [pid 13271] newfstatat(10, "ompi-f77.pc", <unfinished ...> >>>> [pid 13257] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> >>>> [pid 13254] <... futex resumed> ) = -1 EAGAIN (Resource temporarily >>>> unavailable) >>>> [pid 13270] <... openat resumed> ) = 12 >>>> [pid 13257] <... futex resumed> ) = 1 >>>> [pid 13255] <... futex resumed> ) = 0 >>>> [pid 13254] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> >>>> [pid 13270] ioctl(12, _IOC(_IOC_READ, 0x66, 0xad, 0x08), 0x7f284190a680 >>>> <unfinished ...> >>>> [pid 13257] poll([{fd=15, events=POLLIN|POLLPRI}], 1, 0 <unfinished ...> >>>> [pid 13255] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> >>>> [pid 13270] <... ioctl resumed> ) = 0 >>>> [pid 13257] <... poll resumed> ) = 0 (Timeout) >>>> [pid 13255] <... futex resumed> ) = 1 >>>> [pid 13254] <... futex resumed> ) = 1 >>>> [pid 13253] <... futex resumed> ) = 0 >>>> [pid 13268] <... futex resumed> ) = 0 >>>> [pid 13270] close(12 <unfinished ...> >>>> [pid 13257] write(15, "\7\0\0\0\3COMMIT", 11 <unfinished ...> >>>> [pid 13255] write(2, "2017/08/23 08:39:44 [13249/6] Li"..., 108 >>>> <unfinished ...> >>>> 2017/08/23 08:39:44 [13249/6] ListMgr | SQL query: SELECT id FROM ENTRIES >>>> WHERE id='0x200000405:0xd5b0:0x0' >>>> [pid 13268] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished >>>> ...> >>>> [pid 13253] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished >>>> ...> >>>> [pid 13257] <... write resumed> ) = 11 >>>> [pid 13255] <... write resumed> ) = 108 >>>> [pid 13254] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished >>>> ...> >>>> [pid 13271] <... newfstatat resumed> {st_mode=S_IFREG|0640, st_size=665, >>>> ...}, AT_SYMLINK_NOFOLLOW) = 0 >>>> [pid 13257] read(15, <unfinished ...> >>>> [pid 13255] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> >>>> [pid 13254] <... futex resumed> ) = -1 EAGAIN (Resource temporarily >>>> unavailable) >>>> [pid 13271] openat(10, "ompi-f77.pc", >>>> O_RDONLY|O_NONBLOCK|O_NOFOLLOW|O_NOATIME <unfinished ...> >>>> [pid 13266] <... futex resumed> ) = 0 >>>> [pid 13255] <... futex resumed> ) = 1 >>>> [pid 13254] write(2, "2017/08/23 08:39:44 [13249/5] Li"..., 1082017/08/23 >>>> 08:39:44 [13249/5] ListMgr | SQL query: SELECT id FROM ENTRIES WHERE >>>> id='0x200000405:0xcc68:0x0' >>>> <unfinished ...> >>>> [pid 13266] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished >>>> ...> >>>> [pid 13255] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished >>>> ...> >>>> [pid 13254] <... write resumed> ) = 108 >>>> [pid 13254] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1) = 1 >>>> [pid 13270] <... close resumed> ) = 0 >>>> [pid 13259] <... futex resumed> ) = 0 >>>> Write failed: Broken pipe >>>> >>>> NOTE: I captured the entire output from strace in a screen log. >>>> >>>> ==== >>>> >>>> Joe Mervini >>>> Sandia National Laboratories >>>> High Performance Computing >>>> 505.844.6770 >>>> jame...@sandia.gov >>>> >>>> >>>> >>>>> On Aug 22, 2017, at 9:17 AM, Mervini, Joseph A <jame...@sandia.gov> wrote: >>>>> >>>>> Hello, >>>>> >>>>> I have not received a response to this posting but have continued to try >>>>> and figure out why this problem persists. >>>>> >>>>> Since I initially opened the request I have been able to duplicate it on >>>>> three different machines. I have also tried multiple kernel versions and >>>>> lustre 2.8 client versions. I have also completely rebuilt my lustre file >>>>> system will a newer version of 2.8 (2.8.0.9.) The problem only presents >>>>> itself when running a scan against the 2.8 lustre file system. A 2.5 >>>>> lustre file system works fine. >>>>> >>>>> I decided to run the simultaneous scan against both file systems using >>>>> valgrind and although (again) the 2.5 version of the file system >>>>> completed the system rebooted prior to the 2.8 version scan completed. >>>>> However in both scan valgrind’s output was similar with output like this: >>>>> >>>>> ==8883== Thread 7: >>>>> ==8883== Conditional jump or move depends on uninitialised value(s) >>>>> ==8883== at 0x44B8E8: batch_insert_stripe_info (listmgr_stripe.c:168) >>>>> ==8883== by 0x44B1F1: listmgr_batch_insert_no_tx (listmgr_insert.c:246) >>>>> ==8883== by 0x44B35D: ListMgr_Insert (listmgr_insert.c:274) >>>>> ==8883== by 0x4199E4: EntryProc_db_apply (std_pipeline.c:1616) >>>>> ==8883== by 0x415B5D: entry_proc_worker_thr (entry_proc_impl.c:145) >>>>> ==8883== by 0x60B2DC4: start_thread (in /usr/lib64/libpthread-2.17.so) >>>>> ==8883== by 0x6B0D76C: clone (in /usr/lib64/libc-2.17.so) >>>>> ==8883== >>>>> ==8883== Conditional jump or move depends on uninitialised value(s) >>>>> ==8883== at 0x44B9C5: batch_insert_stripe_info (listmgr_stripe.c:168) >>>>> ==8883== by 0x44B1F1: listmgr_batch_insert_no_tx (listmgr_insert.c:246) >>>>> ==8883== by 0x44B35D: ListMgr_Insert (listmgr_insert.c:274) >>>>> ==8883== by 0x4199E4: EntryProc_db_apply (std_pipeline.c:1616) >>>>> ==8883== by 0x415B5D: entry_proc_worker_thr (entry_proc_impl.c:145) >>>>> ==8883== by 0x60B2DC4: start_thread (in /usr/lib64/libpthread-2.17.so) >>>>> ==8883== by 0x6B0D76C: clone (in /usr/lib64/libc-2.17.so) >>>>> ==8883== >>>>> ==8883== >>>>> ==8883== More than 10000000 total errors detected. I'm not reporting any >>>>> more. >>>>> ==8883== Final error counts will be inaccurate. Go fix your program! >>>>> ==8883== Rerun with --error-limit=no to disable this cutoff. Note >>>>> ==8883== that errors may occur in your program without prior warning from >>>>> ==8883== Valgrind, because errors are no longer being displayed. >>>>> ==8883== >>>>> ==== >>>>> >>>>> >>>>> Is this considered normal? >>>>> >>>>> >>>>> Joe Mervini >>>>> Sandia National Laboratories >>>>> High Performance Computing >>>>> 505.844.6770 >>>>> jame...@sandia.gov >>>>> >>>>> >>>>> >>>>>> On Jul 11, 2017, at 10:17 AM, Mervini, Joseph A <jame...@sandia.gov> >>>>>> wrote: >>>>>> >>>>>> Hello, >>>>>> >>>>>> I have a problem similar to >>>>>> https://sourceforge.net/p/robinhood/mailman/message/35883907/ in which >>>>>> the robinhood server running mariadb-5.5.52-1.el7.x86_64 and lustre >>>>>> 2.8.0.8 client will reboot when the initial scan is run. I am running >>>>>> this in a testbed environment prior to deployment on our production >>>>>> system because I want to get a complete handle on it before I commit to >>>>>> the deployment. I have 2 separate lustre file systems that I am running >>>>>> against: One is a 408TB lustre 2.8 file system with ~16M inodes, the >>>>>> other is a 204TB lustre 2.5.5 file system with ~3M inodes. >>>>>> >>>>>> The curious thing is that I had successfully scanned both file systems >>>>>> independently on the system with everything working (including web-gui) >>>>>> and then basically blew away the databases to get a datapoint on how the >>>>>> system performed and the time it took if I ran a scan on both file >>>>>> systems simultaneously. It appears that it is only impacting the 2.8 >>>>>> file system database. I just ran a fresh scan against the 2.5.5 file >>>>>> system without problem. I then stated a new scan against the 2.8 file >>>>>> system an once again it rebooted. >>>>>> >>>>>> Like the other support ticket above, when I ran the scan only on the 2.8 >>>>>> file system in debug mode it also reported messages similar to >>>>>> “2017/07/10 15:44:58 [15191/6] FS_Scan | openat failed on >>>>>> <parent_fd=18>/libippch.so: Too many levels of symbolic links”. I check >>>>>> a large number of the files that were being reported and for the most >>>>>> part they were library files with only a couple of symlinks to the .so >>>>>> file in the same directory. >>>>>> >>>>>> The only other thing of note that I was able to capture is this from the >>>>>> console output: >>>>>> >>>>>> [ 3301.937577] LustreError: >>>>>> 15209:0:(linux-module.c:92:obd_ioctl_getdata()) Version mismatch kernel >>>>>> (10004) vs application (0) >>>>>> [ 3301.950059] LustreError: >>>>>> 15209:0:(class_obd.c:230:class_handle_ioctl()) OBD ioctl: data error" >>>>>> >>>>>> There was no indication of a fault in any of the log files and I was >>>>>> running top and htop during the process and neither CPU or memory was >>>>>> exhausted. Nor did I see anything suspicious happening on the file >>>>>> system itself. >>>>>> >>>>>> Any help or clues as to why this is failing would be greatly >>>>>> appreciated. Thanks in advance. >>>>>> ==== >>>>>> >>>>>> Joe Mervini >>>>>> Sandia National Laboratories >>>>>> High Performance Computing >>>>>> 505.844.6770 >>>>>> jame...@sandia.gov >>>>>> >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> Check out the vibrant tech community on one of the world's most >>>>>> engaging tech sites, Slashdot.org! >>>>>> http://sdm.link/slashdot_______________________________________________ >>>>>> robinhood-support mailing list >>>>>> robinhood-support@lists.sourceforge.net >>>>>> https://lists.sourceforge.net/lists/listinfo/robinhood-support >>>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Check out the vibrant tech community on one of the world's most >>>> engaging tech sites, Slashdot.org! >>>> http://sdm.link/slashdot_______________________________________________ >>>> robinhood-support mailing list >>>> robinhood-support@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/robinhood-support >>> >>> ------------------------------------------------------------------------------ >>> Check out the vibrant tech community on one of the world's most >>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >>> _______________________________________________ >>> robinhood-support mailing list >>> robinhood-support@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/robinhood-support >> >> ------------------------------------------------------------------------------ >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> _______________________________________________ >> robinhood-support mailing list >> robinhood-support@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/robinhood-support > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > robinhood-support mailing list > robinhood-support@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/robinhood-support ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ robinhood-support mailing list robinhood-support@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/robinhood-support