Re: [robinhood-support] [EXTERNAL] Robinhood 3.0.1 rebooting unexpectedly on initial scan with Lustre 2.8.0.8

Mervini, Joseph A Wed, 23 Aug 2017 10:51:50 -0700

I am able to verify that the cause of the failures was in fact due to 
/dev/watchdog and /dev/watchdog0 existing on the lustre file system. The 
problem was easily duplicated by confining the scan to the dev directory and in 
each case before the individual files were removed the system would reboot. 
After removing the files the scan ran to completion.


Thanks again to Justin Miller for the information and explanation. 
====

Joe Mervini
Sandia National Laboratories
High Performance Computing
505.844.6770
jame...@sandia.gov



> On Aug 23, 2017, at 10:28 AM, Mervini, Joseph A <jame...@sandia.gov> wrote:
> 
> Justin
> 
> Yes - as a matter of fact I have copied the entire OS to the lustre file 
> system to create a ton of real files. I am testing that hypothesis now.
> 
> Thanks so much for the suggestion! I will post my results.
> ====
> 
> Joe Mervini
> Sandia National Laboratories
> High Performance Computing
> 505.844.6770
> jame...@sandia.gov
> 
> 
> 
>> On Aug 23, 2017, at 10:03 AM, Justin Miller <jmil...@cray.com> wrote:
>> 
>> Does the data you’re scanning include backups of OS root directories? 
>> 
>> We have seen the case where a scan causes a system to reboot with almost no 
>> explanation because the data being scanned included a copy of /dev from a OS 
>> root filesystem backup, and inside that copy of /dev was a character device 
>> file for a watchdog device handler. The Lustre client running the scan also 
>> had the modules loaded to handle the watchdog.
>> 
>> The watchdog timer starts during the scan when path2fid does an open. The 
>> watchdog timer isn’t terminated until a special flag is specified, and 
>> Lustre doesn’t know about the flag, so the watchdog times out and reboots 
>> the system.
>> 
>> - Justin Miller
>> 
>>> On Aug 23, 2017, at 9:55 AM, Mervini, Joseph A <jame...@sandia.gov> wrote:
>>> 
>>> I just re-ran the scan using strace -f and this is the tail of the output 
>>> up until the machine rebooted:
>>> 
>>> 2017/08/23 08:39:44 [13249/7] ListMgr | SQL query: COMMIT
>>> [pid 13254] <... futex resumed> )       = -1 EAGAIN (Resource temporarily 
>>> unavailable)
>>> [pid 13267] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished 
>>> ...>
>>> [pid 13263] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished 
>>> ...>
>>> [pid 13271] <... ioctl resumed> , 0x7f28351b9490) = 0
>>> [pid 13257] <... write resumed> )       = 58
>>> [pid 13254] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished 
>>> ...>
>>> [pid 13271] newfstatat(10, "ompi-f77.pc",  <unfinished ...>
>>> [pid 13257] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
>>> [pid 13254] <... futex resumed> )       = -1 EAGAIN (Resource temporarily 
>>> unavailable)
>>> [pid 13270] <... openat resumed> )      = 12
>>> [pid 13257] <... futex resumed> )       = 1
>>> [pid 13255] <... futex resumed> )       = 0
>>> [pid 13254] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
>>> [pid 13270] ioctl(12, _IOC(_IOC_READ, 0x66, 0xad, 0x08), 0x7f284190a680 
>>> <unfinished ...>
>>> [pid 13257] poll([{fd=15, events=POLLIN|POLLPRI}], 1, 0 <unfinished ...>
>>> [pid 13255] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
>>> [pid 13270] <... ioctl resumed> )       = 0
>>> [pid 13257] <... poll resumed> )        = 0 (Timeout)
>>> [pid 13255] <... futex resumed> )       = 1
>>> [pid 13254] <... futex resumed> )       = 1
>>> [pid 13253] <... futex resumed> )       = 0
>>> [pid 13268] <... futex resumed> )       = 0
>>> [pid 13270] close(12 <unfinished ...>
>>> [pid 13257] write(15, "\7\0\0\0\3COMMIT", 11 <unfinished ...>
>>> [pid 13255] write(2, "2017/08/23 08:39:44 [13249/6] Li"..., 108 <unfinished 
>>> ...>
>>> 2017/08/23 08:39:44 [13249/6] ListMgr | SQL query: SELECT id FROM ENTRIES 
>>> WHERE id='0x200000405:0xd5b0:0x0'
>>> [pid 13268] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished 
>>> ...>
>>> [pid 13253] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished 
>>> ...>
>>> [pid 13257] <... write resumed> )       = 11
>>> [pid 13255] <... write resumed> )       = 108
>>> [pid 13254] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished 
>>> ...>
>>> [pid 13271] <... newfstatat resumed> {st_mode=S_IFREG|0640, st_size=665, 
>>> ...}, AT_SYMLINK_NOFOLLOW) = 0
>>> [pid 13257] read(15,  <unfinished ...>
>>> [pid 13255] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
>>> [pid 13254] <... futex resumed> )       = -1 EAGAIN (Resource temporarily 
>>> unavailable)
>>> [pid 13271] openat(10, "ompi-f77.pc", 
>>> O_RDONLY|O_NONBLOCK|O_NOFOLLOW|O_NOATIME <unfinished ...>
>>> [pid 13266] <... futex resumed> )       = 0
>>> [pid 13255] <... futex resumed> )       = 1
>>> [pid 13254] write(2, "2017/08/23 08:39:44 [13249/5] Li"..., 1082017/08/23 
>>> 08:39:44 [13249/5] ListMgr | SQL query: SELECT id FROM ENTRIES WHERE 
>>> id='0x200000405:0xcc68:0x0'
>>> <unfinished ...>
>>> [pid 13266] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished 
>>> ...>
>>> [pid 13255] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished 
>>> ...>
>>> [pid 13254] <... write resumed> )       = 108
>>> [pid 13254] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1) = 1
>>> [pid 13270] <... close resumed> )       = 0
>>> [pid 13259] <... futex resumed> )       = 0
>>> Write failed: Broken pipe
>>> 
>>> NOTE: I captured the entire output from strace in a screen log.
>>> 
>>> ====
>>> 
>>> Joe Mervini
>>> Sandia National Laboratories
>>> High Performance Computing
>>> 505.844.6770
>>> jame...@sandia.gov
>>> 
>>> 
>>> 
>>>> On Aug 22, 2017, at 9:17 AM, Mervini, Joseph A <jame...@sandia.gov> wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> I have not received a response to this posting but have continued to try 
>>>> and figure out why this problem persists. 
>>>> 
>>>> Since I initially opened the request I have been able to duplicate it on 
>>>> three different machines. I have also tried multiple kernel versions and 
>>>> lustre 2.8 client versions. I have also completely rebuilt my lustre file 
>>>> system will a newer version of 2.8 (2.8.0.9.) The problem only presents 
>>>> itself when running a scan against the 2.8 lustre file system. A 2.5 
>>>> lustre file system works fine.
>>>> 
>>>> I decided to run the simultaneous scan against both file systems using 
>>>> valgrind and although (again) the 2.5 version of the file system completed 
>>>> the system rebooted prior to the 2.8 version scan completed. However in 
>>>> both scan valgrind’s output was similar with output like this:
>>>> 
>>>> ==8883== Thread 7:
>>>> ==8883== Conditional jump or move depends on uninitialised value(s)
>>>> ==8883==    at 0x44B8E8: batch_insert_stripe_info (listmgr_stripe.c:168)
>>>> ==8883==    by 0x44B1F1: listmgr_batch_insert_no_tx (listmgr_insert.c:246)
>>>> ==8883==    by 0x44B35D: ListMgr_Insert (listmgr_insert.c:274)
>>>> ==8883==    by 0x4199E4: EntryProc_db_apply (std_pipeline.c:1616)
>>>> ==8883==    by 0x415B5D: entry_proc_worker_thr (entry_proc_impl.c:145)
>>>> ==8883==    by 0x60B2DC4: start_thread (in /usr/lib64/libpthread-2.17.so)
>>>> ==8883==    by 0x6B0D76C: clone (in /usr/lib64/libc-2.17.so)
>>>> ==8883==
>>>> ==8883== Conditional jump or move depends on uninitialised value(s)
>>>> ==8883==    at 0x44B9C5: batch_insert_stripe_info (listmgr_stripe.c:168)
>>>> ==8883==    by 0x44B1F1: listmgr_batch_insert_no_tx (listmgr_insert.c:246)
>>>> ==8883==    by 0x44B35D: ListMgr_Insert (listmgr_insert.c:274)
>>>> ==8883==    by 0x4199E4: EntryProc_db_apply (std_pipeline.c:1616)
>>>> ==8883==    by 0x415B5D: entry_proc_worker_thr (entry_proc_impl.c:145)
>>>> ==8883==    by 0x60B2DC4: start_thread (in /usr/lib64/libpthread-2.17.so)
>>>> ==8883==    by 0x6B0D76C: clone (in /usr/lib64/libc-2.17.so)
>>>> ==8883==
>>>> ==8883==
>>>> ==8883== More than 10000000 total errors detected.  I'm not reporting any 
>>>> more.
>>>> ==8883== Final error counts will be inaccurate.  Go fix your program!
>>>> ==8883== Rerun with --error-limit=no to disable this cutoff.  Note
>>>> ==8883== that errors may occur in your program without prior warning from
>>>> ==8883== Valgrind, because errors are no longer being displayed.
>>>> ==8883== 
>>>> ====
>>>> 
>>>> 
>>>> Is this considered normal?
>>>> 
>>>> 
>>>> Joe Mervini
>>>> Sandia National Laboratories
>>>> High Performance Computing
>>>> 505.844.6770
>>>> jame...@sandia.gov
>>>> 
>>>> 
>>>> 
>>>>> On Jul 11, 2017, at 10:17 AM, Mervini, Joseph A <jame...@sandia.gov> 
>>>>> wrote:
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> I have a problem similar to 
>>>>> https://sourceforge.net/p/robinhood/mailman/message/35883907/ in which 
>>>>> the robinhood server running mariadb-5.5.52-1.el7.x86_64 and lustre 
>>>>> 2.8.0.8 client will reboot when the initial scan is run. I am running 
>>>>> this in a testbed environment prior to deployment on our production 
>>>>> system because I want to get a complete handle on it before I commit to 
>>>>> the deployment. I have 2 separate lustre file systems that I am running 
>>>>> against: One is a 408TB lustre 2.8 file system with ~16M inodes, the 
>>>>> other is a 204TB lustre 2.5.5 file system with ~3M inodes. 
>>>>> 
>>>>> The curious thing is that I had successfully scanned both file systems 
>>>>> independently on the system with everything working (including web-gui) 
>>>>> and then basically blew away the databases to get a datapoint on how the 
>>>>> system performed and the time it took if I ran a scan on both file 
>>>>> systems simultaneously. It appears that it is only impacting the 2.8 file 
>>>>> system database. I just ran a fresh scan against the 2.5.5 file system 
>>>>> without problem. I then stated a new scan against the 2.8 file system an 
>>>>> once again it rebooted.
>>>>> 
>>>>> Like the other support ticket above, when I ran the scan only on the 2.8 
>>>>> file system in debug mode it also reported messages similar to 
>>>>> “2017/07/10 15:44:58 [15191/6] FS_Scan | openat failed on 
>>>>> <parent_fd=18>/libippch.so: Too many levels of symbolic links”. I check a 
>>>>> large number of the files that were being reported and for the most part 
>>>>> they were library files with only a couple of symlinks to the .so file in 
>>>>> the same directory.
>>>>> 
>>>>> The only other thing of note that I was able to capture is this from the 
>>>>> console output:
>>>>> 
>>>>> [ 3301.937577] LustreError: 
>>>>> 15209:0:(linux-module.c:92:obd_ioctl_getdata()) Version mismatch kernel 
>>>>> (10004) vs application (0) 
>>>>> [ 3301.950059] LustreError: 
>>>>> 15209:0:(class_obd.c:230:class_handle_ioctl()) OBD ioctl: data error"
>>>>> 
>>>>> There was no indication of a fault in any of the log files and I was 
>>>>> running top and htop during the process and neither CPU or memory was 
>>>>> exhausted. Nor did I see anything suspicious happening on the file system 
>>>>> itself. 
>>>>> 
>>>>> Any help or clues as to why this is failing would be greatly appreciated. 
>>>>> Thanks in advance.
>>>>> ====
>>>>> 
>>>>> Joe Mervini
>>>>> Sandia National Laboratories
>>>>> High Performance Computing
>>>>> 505.844.6770
>>>>> jame...@sandia.gov
>>>>> 
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------------------
>>>>> Check out the vibrant tech community on one of the world's most
>>>>> engaging tech sites, Slashdot.org! 
>>>>> http://sdm.link/slashdot_______________________________________________
>>>>> robinhood-support mailing list
>>>>> robinhood-support@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/robinhood-support
>>>> 
>>> 
>>> ------------------------------------------------------------------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! 
>>> http://sdm.link/slashdot_______________________________________________
>>> robinhood-support mailing list
>>> robinhood-support@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/robinhood-support
>> 
>> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> _______________________________________________
>> robinhood-support mailing list
>> robinhood-support@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/robinhood-support
> 
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> robinhood-support mailing list
> robinhood-support@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/robinhood-support

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
robinhood-support mailing list
robinhood-support@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/robinhood-support

Re: [robinhood-support] [EXTERNAL] Robinhood 3.0.1 rebooting unexpectedly on initial scan with Lustre 2.8.0.8

Reply via email to