Re: [robinhood-support] Robinhood 3.0.1 rebooting unexpectedly on initial scan with Lustre 2.8.0.8

LEIBOVICI Thomas Wed, 23 Aug 2017 08:03:59 -0700

If you get kernel crashes, a system dump, the system console or thesystem logs would give more information about the root cause, ratherthan a userland process.


On 08/23/17 16:55, Mervini, Joseph A wrote:

I just re-ran the scan using strace -f and this is the tail of theoutput up until the machine rebooted:
2017/08/23 08:39:44 [13249/7] ListMgr | SQL query: COMMIT
[pid 13254] <... futex resumed> ) = -1 EAGAIN (Resourcetemporarily unavailable)[pid 13267] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL<unfinished ...>[pid 13263] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL<unfinished ...>
[pid 13271] <... ioctl resumed> , 0x7f28351b9490) = 0
[pid 13257] <... write resumed> )       = 58
[pid 13254] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL<unfinished ...>
[pid 13271] newfstatat(10, "ompi-f77.pc",  <unfinished ...>
[pid 13257] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 13254] <... futex resumed> ) = -1 EAGAIN (Resourcetemporarily unavailable)
[pid 13270] <... openat resumed> )      = 12
[pid 13257] <... futex resumed> )       = 1
[pid 13255] <... futex resumed> )       = 0
[pid 13254] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 13270] ioctl(12, _IOC(_IOC_READ, 0x66, 0xad, 0x08),0x7f284190a680 <unfinished ...>
[pid 13257] poll([{fd=15, events=POLLIN|POLLPRI}], 1, 0 <unfinished ...>
[pid 13255] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 13270] <... ioctl resumed> )       = 0
[pid 13257] <... poll resumed> )        = 0 (Timeout)
[pid 13255] <... futex resumed> )       = 1
[pid 13254] <... futex resumed> )       = 1
[pid 13253] <... futex resumed> )       = 0
[pid 13268] <... futex resumed> )       = 0
[pid 13270] close(12 <unfinished ...>
[pid 13257] write(15, "\7\0\0\0\3COMMIT", 11 <unfinished ...>
[pid 13255] write(2, "2017/08/23 08:39:44 [13249/6] Li"..., 108<unfinished ...>2017/08/23 08:39:44 [13249/6] ListMgr | SQL query: SELECT id FROMENTRIES WHERE id='0x200000405:0xd5b0:0x0'[pid 13268] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL<unfinished ...>[pid 13253] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL<unfinished ...>
[pid 13257] <... write resumed> )       = 11
[pid 13255] <... write resumed> )       = 108
[pid 13254] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL<unfinished ...>[pid 13271] <... newfstatat resumed> {st_mode=S_IFREG|0640,st_size=665, ...}, AT_SYMLINK_NOFOLLOW) = 0
[pid 13257] read(15,  <unfinished ...>
[pid 13255] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 13254] <... futex resumed> ) = -1 EAGAIN (Resourcetemporarily unavailable)[pid 13271] openat(10, "ompi-f77.pc",O_RDONLY|O_NONBLOCK|O_NOFOLLOW|O_NOATIME <unfinished ...>
[pid 13266] <... futex resumed> )       = 0
[pid 13255] <... futex resumed> )       = 1
[pid 13254] write(2, "2017/08/23 08:39:44 [13249/5] Li"...,1082017/08/23 08:39:44 [13249/5] ListMgr | SQL query: SELECT id FROMENTRIES WHERE id='0x200000405:0xcc68:0x0'
 <unfinished ...>
[pid 13266] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL<unfinished ...>[pid 13255] futex(0x7f287e59e9f0, FUTEX_WAIT_PRIVATE, 2, NULL<unfinished ...>
[pid 13254] <... write resumed> )       = 108
[pid 13254] futex(0x7f287e59e9f0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 13270] <... close resumed> )       = 0
[pid 13259] <... futex resumed> )       = 0
Write failed: Broken pipe

NOTE: I captured the entire output from strace in a screen log.

====

Joe Mervini
Sandia National Laboratories
High Performance Computing
505.844.6770
jame...@sandia.gov <mailto:jame...@sandia.gov>
On Aug 22, 2017, at 9:17 AM, Mervini, Joseph A <jame...@sandia.gov<mailto:jame...@sandia.gov>> wrote:
Hello,
I have not received a response to this posting but have continued totry and figure out why this problem persists.
Since I initially opened the request I have been able to duplicate iton three different machines. I have also tried multiple kernelversions and lustre 2.8 client versions. I have also completelyrebuilt my lustre file system will a newer version of 2.8 (2.8.0.9.)The problem only presents itself when running a scan against the 2.8lustre file system. A 2.5 lustre file system works fine.
I decided to run the simultaneous scan against both file systemsusing valgrind and although (again) the 2.5 version of the filesystem completed the system rebooted prior to the 2.8 version scancompleted. However in both scan valgrind’s output was similar withoutput like this:
==8883== Thread 7:
==8883== Conditional jump or move depends on uninitialised value(s)
==8883==    at 0x44B8E8: batch_insert_stripe_info (listmgr_stripe.c:168)
==8883== by 0x44B1F1: listmgr_batch_insert_no_tx(listmgr_insert.c:246)
==8883==    by 0x44B35D: ListMgr_Insert (listmgr_insert.c:274)
==8883==    by 0x4199E4: EntryProc_db_apply (std_pipeline.c:1616)
==8883==    by 0x415B5D: entry_proc_worker_thr (entry_proc_impl.c:145)
==8883==    by 0x60B2DC4: start_thread (in /usr/lib64/libpthread-2.17.so)
==8883==    by 0x6B0D76C: clone (in /usr/lib64/libc-2.17.so)
==8883==
==8883== Conditional jump or move depends on uninitialised value(s)
==8883==    at 0x44B9C5: batch_insert_stripe_info (listmgr_stripe.c:168)
==8883== by 0x44B1F1: listmgr_batch_insert_no_tx(listmgr_insert.c:246)
==8883==    by 0x44B35D: ListMgr_Insert (listmgr_insert.c:274)
==8883==    by 0x4199E4: EntryProc_db_apply (std_pipeline.c:1616)
==8883==    by 0x415B5D: entry_proc_worker_thr (entry_proc_impl.c:145)
==8883==    by 0x60B2DC4: start_thread (in /usr/lib64/libpthread-2.17.so)
==8883==    by 0x6B0D76C: clone (in /usr/lib64/libc-2.17.so)
==8883==
==8883==
==8883== More than 10000000 total errors detected. I'm not reportingany more.
==8883== Final error counts will be inaccurate.  Go fix your program!
==8883== Rerun with --error-limit=no to disable this cutoff.  Note
==8883== that errors may occur in your program without prior warning from
==8883== Valgrind, because errors are no longer being displayed.
==8883==
====


Is this considered normal?


Joe Mervini
Sandia National Laboratories
High Performance Computing
505.844.6770
jame...@sandia.gov <mailto:jame...@sandia.gov>
On Jul 11, 2017, at 10:17 AM, Mervini, Joseph A <jame...@sandia.gov<mailto:jame...@sandia.gov>> wrote:
Hello,
I have a problem similar tohttps://sourceforge.net/p/robinhood/mailman/message/35883907/ inwhich the robinhood server running mariadb-5.5.52-1.el7.x86_64 andlustre 2.8.0.8 client will reboot when the initial scan is run. I amrunning this in a testbed environment prior to deployment on ourproduction system because I want to get a complete handle on itbefore I commit to the deployment. I have 2 separate lustre filesystems that I am running against: One is a 408TB lustre 2.8 filesystem with ~16M inodes, the other is a 204TB lustre 2.5.5 filesystem with ~3M inodes.
The curious thing is that I had successfully scanned both filesystems independently on the system with everything working(including web-gui) and then basically blew away the databases toget a datapoint on how the system performed and the time it took ifI ran a scan on both file systems simultaneously. It appears that itis only impacting the 2.8 file system database. I just ran a freshscan against the 2.5.5 file system without problem. I then stated anew scan against the 2.8 file system an once again it rebooted.
Like the other support ticket above, when I ran the scan only on the2.8 file system in debug mode it also reported messages similar to“2017/07/10 15:44:58 [15191/6] FS_Scan | openat failed on<parent_fd=18>/libippch.so: Too many levels of symbolic links”. Icheck a large number of the files that were being reported and forthe most part they were library files with only a couple of symlinksto the .so file in the same directory.
The only other thing of note that I was able to capture is this fromthe console output:
[ 3301.937577] LustreError:15209:0:(linux-module.c:92:obd_ioctl_getdata()) Version mismatchkernel (10004) vs application (0)[ 3301.950059] LustreError:15209:0:(class_obd.c:230:class_handle_ioctl()) OBD ioctl: data error"
There was no indication of a fault in any of the log files and I wasrunning top and htop during the process and neither CPU or memorywas exhausted. Nor did I see anything suspicious happening on thefile system itself.
Any help or clues as to why this is failing would be greatlyappreciated. Thanks in advance.
====

Joe Mervini
Sandia National Laboratories
High Performance Computing
505.844.6770
jame...@sandia.gov <mailto:jame...@sandia.gov>



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org <http://slashdot.org/>!http://sdm.link/slashdot_______________________________________________
robinhood-support mailing list
robinhood-support@lists.sourceforge.net<mailto:robinhood-support@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/robinhood-support
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot


_______________________________________________
robinhood-support mailing list
robinhood-support@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/robinhood-support

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
robinhood-support mailing list
robinhood-support@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/robinhood-support

Re: [robinhood-support] Robinhood 3.0.1 rebooting unexpectedly on initial scan with Lustre 2.8.0.8

Reply via email to