On 08/22/17 17:17, Mervini, Joseph A wrote:
Hello,
I have not received a response to this posting but have continued to
try and figure out why this problem persists.
Since I initially opened the request I have been able to duplicate it
on three different machines. I have also tried multiple kernel
versions and lustre 2.8 client versions. I have also completely
rebuilt my lustre file system will a newer version of 2.8 (2.8.0.9.)
The problem only presents itself when running a scan against the 2.8
lustre file system. A 2.5 lustre file system works fine.
Hello,
Do you get a kernel crash or a robinhood process crash ?
I decided to run the simultaneous scan against both file systems using
valgrind and although (again) the 2.5 version of the file system
completed the system rebooted prior to the 2.8 version scan completed.
However in both scan valgrind’s output was similar with output like this:
==8883== Thread 7:
==8883== Conditional jump or move depends on uninitialised value(s)
==8883== at 0x44B8E8: batch_insert_stripe_info (listmgr_stripe.c:168)
==8883== by 0x44B1F1: listmgr_batch_insert_no_tx (listmgr_insert.c:246)
==8883== by 0x44B35D: ListMgr_Insert (listmgr_insert.c:274)
==8883== by 0x4199E4: EntryProc_db_apply (std_pipeline.c:1616)
==8883== by 0x415B5D: entry_proc_worker_thr (entry_proc_impl.c:145)
==8883== by 0x60B2DC4: start_thread (in /usr/lib64/libpthread-2.17.so)
==8883== by 0x6B0D76C: clone (in /usr/lib64/libc-2.17.so)
==8883==
==8883== Conditional jump or move depends on uninitialised value(s)
==8883== at 0x44B9C5: batch_insert_stripe_info (listmgr_stripe.c:168)
==8883== by 0x44B1F1: listmgr_batch_insert_no_tx (listmgr_insert.c:246)
==8883== by 0x44B35D: ListMgr_Insert (listmgr_insert.c:274)
==8883== by 0x4199E4: EntryProc_db_apply (std_pipeline.c:1616)
==8883== by 0x415B5D: entry_proc_worker_thr (entry_proc_impl.c:145)
==8883== by 0x60B2DC4: start_thread (in /usr/lib64/libpthread-2.17.so)
==8883== by 0x6B0D76C: clone (in /usr/lib64/libc-2.17.so)
==8883==
==8883==
==8883== More than 10000000 total errors detected. I'm not reporting
any more.
==8883== Final error counts will be inaccurate. Go fix your program!
==8883== Rerun with --error-limit=no to disable this cutoff. Note
==8883== that errors may occur in your program without prior warning from
==8883== Valgrind, because errors are no longer being displayed.
==8883==
====
Is this considered normal?
Yes, actually valgrind is confused by the fact striping information is
filled in by an ioctl() call (equivalent to getstripe)
and it has no idea what area of memory is set by this IOCTL.
Then valgrind propagates the warning to every decision based on stripe info.
Thomas
Joe Mervini
Sandia National Laboratories
High Performance Computing
505.844.6770
jame...@sandia.gov <mailto:jame...@sandia.gov>
On Jul 11, 2017, at 10:17 AM, Mervini, Joseph A <jame...@sandia.gov
<mailto:jame...@sandia.gov>> wrote:
Hello,
I have a problem similar to
https://sourceforge.net/p/robinhood/mailman/message/35883907/ in
which the robinhood server running mariadb-5.5.52-1.el7.x86_64 and
lustre 2.8.0.8 client will reboot when the initial scan is run. I am
running this in a testbed environment prior to deployment on our
production system because I want to get a complete handle on it
before I commit to the deployment. I have 2 separate lustre file
systems that I am running against: One is a 408TB lustre 2.8 file
system with ~16M inodes, the other is a 204TB lustre 2.5.5 file
system with ~3M inodes.
The curious thing is that I had successfully scanned both file
systems independently on the system with everything working
(including web-gui) and then basically blew away the databases to get
a datapoint on how the system performed and the time it took if I ran
a scan on both file systems simultaneously. It appears that it is
only impacting the 2.8 file system database. I just ran a fresh scan
against the 2.5.5 file system without problem. I then stated a new
scan against the 2.8 file system an once again it rebooted.
Like the other support ticket above, when I ran the scan only on the
2.8 file system in debug mode it also reported messages similar to
“2017/07/10 15:44:58 [15191/6] FS_Scan | openat failed on
<parent_fd=18>/libippch.so: Too many levels of symbolic links”. I
check a large number of the files that were being reported and for
the most part they were library files with only a couple of symlinks
to the .so file in the same directory.
The only other thing of note that I was able to capture is this from
the console output:
[ 3301.937577] LustreError:
15209:0:(linux-module.c:92:obd_ioctl_getdata()) Version mismatch
kernel (10004) vs application (0)
[ 3301.950059] LustreError:
15209:0:(class_obd.c:230:class_handle_ioctl()) OBD ioctl: data error"
There was no indication of a fault in any of the log files and I was
running top and htop during the process and neither CPU or memory was
exhausted. Nor did I see anything suspicious happening on the file
system itself.
Any help or clues as to why this is failing would be greatly
appreciated. Thanks in advance.
====
Joe Mervini
Sandia National Laboratories
High Performance Computing
505.844.6770
jame...@sandia.gov <mailto:jame...@sandia.gov>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org <http://Slashdot.org>!
http://sdm.link/slashdot_______________________________________________
robinhood-support mailing list
robinhood-support@lists.sourceforge.net
<mailto:robinhood-support@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/robinhood-support
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
robinhood-support mailing list
robinhood-support@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/robinhood-support
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
robinhood-support mailing list
robinhood-support@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/robinhood-support