[EMAIL PROTECTED] wrote on Thu, 13 Mar 2008 13:14 -0400:
> By tracing the call down through libibverbs, I discovered that
> the call path went into some sort of compatibility layer in
> OFED and ended up hitting __ibv_create_cq_1_0 (in
> src/userspace/libibverbs/compat-1_0.c) instead of the
> __ibv_create_cq function in src/userspace/libibverbs/src/verbs.c.
> I was able to "correct" this by forcibly linking libpvfs2.so
> with libibverbs, so there's likely some symbol versioning black
> magic going on here. Perhaps Florin could confirm this to be
> his problem as well.
I was just debugging this exact issue in the context of another
project and app and discovered that it is necessary to add -libverbs
to the "gcc -shared" command when building libpvfs2.so. I'm still a
but fuzzy on exactly why. Checked in a patch to mainline just now,
with this message:
shared lib deps
For proper symbol versioning, at shared library build time, it is necessary
to specify all the other shared libraries that will be required to run
the one we are about to create. The particular place this shows up as a
problem is for libibverbs.so when building bmi ib. If you don't tell the
linker that you'll need libibverbs.so later, at runtime, it will pickup
the 1.0 rather than the 1.1 symbols.
> So, I get it up and running (VFS interface and all), and
> quickly hit a failure trying to do a basic
> "dd if=/dev/zero of=testfile bs=4M count=10000" on my pvfs2
> mount. Smaller tests seem to work ok (e.g. count=100).
> Attached is the pvfs-client.log output - and before I try to get
> our IB guys involved I wanted to see if anything jumped out to
> the PVFS2 developer community (perhaps BMI related?) or if I
> could get some help debugging it further.
> [D 15:46:56.524470] [INFO]: Mapping pointer 0x2b09efdf4000 for I/O.
> [D 15:46:56.550917] [INFO]: Mapping pointer 0x2b09f11f6000 for I/O.
> [E 15:49:41.383093] fp_multiqueue_cancel: flow proto cancel called on 0x5acfe8
> [E 15:49:41.383136] handle_io_error: flow proto error cleanup started on
> 0x5acfe8, error_code: -1610613121
Something timed out. Check the server logs and see if they have anything
to say. Once an operation is cancelled, recovery is messy. Would be good
to understand why this cancellation happened.
> [E 15:49:42.697884] Error: ib_check_cq: unknown send state SQ_CANCELLED (10)
> of sq 0x552050.
Fixed this in mainline. Won't help to understand why your operation was
cancelled, though.
> I've got 6 I/O servers and 1 metadata server, pvfs2 storage
> is on SRP-based LUNs on a DDN array, fs.conf is also attached.
>
> Oh, and my pvfs2 configure options:
>
> ./configure --prefix=/afs/ld/software/sys \
> --with-openib=/usr \
> --with-openib-libs=/usr/lib64 \
> --with-kernel=/usr/src/linux-2.6.16.54-0.2.3 \
> --enable-shared --enable-trusted-connections \
> --enable-mmap-racache --without-bmi-tcp
I think you don't need "--with-openib-libs". And trusted
connections only apply to BMI-TCP currently. An interested party
could add this pretty quickly though.
-- Pete
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users