-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Randy,
As long as you're working on IB patches. I just remembered that I had
to apply a patch before I could get 2.8.7 to build on my CentOS 5
machines running their stock IB stack.
- --- src/io/bmi/bmi_ib/openib.c.orig 2013-01-10 15:47:52.000000000
- -0700
+++ src/io/bmi/bmi_ib/openib.c 2013-01-10 15:37:59.000000000 -0700
@@ -745,7 +745,9 @@
#ifdef HAVE_IBV_EVENT_CLIENT_REREGISTER
CASE(IBV_EVENT_CLIENT_REREGISTER);
#endif
+#ifdef HAVE_IBV_EVENT_GID_CHANGE
CASE(IBV_EVENT_GID_CHANGE);
+#endif
}
return s;
}
The issue was brought up in a thread on this list last summer, but I
never saw a final resolution and if there was one it apparently didn't
make it into 2.8.7
Thanks,
Mike Robbert
Colorado School of Mines
On 2/7/13 8:20 AM, Randall Martin wrote:
> I'm working on a set of patches for the IB support. There are
> several issues I'm working through on the patches before I commit
> them. I'll send you a copy when I have them ready for release so
> you can test them.
>
>
> -Randy
>
>
> On 2/7/13 8:54 AM, "Yves Revaz" <[email protected]> wrote:
>
>> On 10/18/2012 11:41 PM, Kyle Schochenmaier wrote:
>>> Hi Yves -
>>>
>>> How frequently do you see these warnings? Does it cause any
>>> servers/clients to hang?
>>
>> Hi Kyle and the list,
>>
>> In a previous mail, I was mentioning the following errors:
>>
>> [E 02/07/2013 14:39:24] Warning: encourage_recv_incoming: mop_id
>> d0e680 in RTS_DONE message not found. [E 02/07/2013 14:39:54]
>> job_time_mgr_expire: job time out: cancelling flow operation,
>> job_id: 17549115350. [E 02/07/2013 14:39:54]
>> fp_multiqueue_cancel: flow proto cancel called on 0x1bce5e0 [E
>> 02/07/2013 14:39:54] fp_multiqueue_cancel: I/O error occurred [E
>> 02/07/2013 14:39:54] handle_io_error: flow proto error cleanup
>> started on 0x1bce5e0: Operation cancelled (possibly due to
>> timeout) [E 02/07/2013 14:39:54] handle_io_error: flow proto
>> 0x1bce5e0 canceled 1 operations, will clean up. [E 02/07/2013
>> 14:39:54] bmi_recv_callback_fn: I/O error occurred [E 02/07/2013
>> 14:39:54] handle_io_error: flow proto 0x1bce5e0 error cleanup
>> finished: Operation cancelled (possibly due to timeout)
>>
>> In fact, I'm trying to move 10Tb of data in our pvfs, using and
>> rsync. When a lot of data are transfered, those errors occurs
>> very frequently, about every 5 minutes, which is very annoying.
>>
>> I've checked our IB network which is perfectly sane. I'm
>> currently using orangefs-2.8.6/. Should I move to 2.8.7 ? Looking
>> at the changelog of the 2.8.7 realease, I don't thinks IB related
>> problems have been fixed.
>>
>> Thanks,
>>
>> yves
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>> If not common/destructive this could be that there was a simple
>>> error case on the infiniband fabric and that the operation
>>> timed out in pvfs and that can be readily ignored as it would
>>> be retransmitted eventually.
>>>
>>> If you see this a lot it may be one of a few issues that we've
>>> fixed in recent releases, which version of orangefs/pvfs are
>>> you using? ~Kyle
>>>
>>> Kyle Schochenmaier
>>>
>>>
>>> On Thu, Oct 18, 2012 at 4:31 PM, Becky
>>> Ligon<[email protected]> wrote:
>>>> Yves:
>>>>
>>>> The timeouts that you listed below are in the configuration
>>>> file.
>>>>
>>>> ClientJobBMITimeoutSecs 300 - The client's job scheduler
>>>> limits each "job" sent across the network to this timeout.
>>>> If the job exceeds this limit, the job is cancelled.
>>>> Depending on the request, the job may be retried. Keep in
>>>> mind that one PVFS request can be made up of many jobs.
>>>>
>>>> ClientJobFlowTimeoutSecs - This value limits the time spent
>>>> on a particular job called a flow. A flow is used to
>>>> transfer data across the network to a server or to transfer
>>>> data from a server to the client. Again, if the flow
>>>> exceeds this timeout, then the flow is cancelled.
>>>>
>>>> The server counterparts for these settings are rarely used,
>>>> since the server doesn't normally initiate reads or writes.
>>>>
>>>> I think your real problem has something to do with IB, but I
>>>> am not an expert in that area. I have cc'd Kyle
>>>> Schochenmaier to see if he can help.
>>>>
>>>> Becky
>>>>
>>>>
>>>>
>>>> On Thu, Oct 18, 2012 at 4:07 PM, Yves
>>>> Revaz<[email protected]> wrote:
>>>>>
>>>>> Dear list,
>>>>>
>>>>> I sometimes have the following error occuring in my pvfs
>>>>> server log.
>>>>>
>>>>> [E 10/18/2012 20:59:50] Warning: encourage_recv_incoming:
>>>>> mop_id 150c320 in RTS_DONE message not found. [E 10/18/2012
>>>>> 21:00:50] job_time_mgr_expire: job time out: cancelling
>>>>> flow operation, job_id: 33307291. [E 10/18/2012 21:00:50]
>>>>> fp_multiqueue_cancel: flow proto cancel called on 0xf18c80
>>>>> [E 10/18/2012 21:00:50] fp_multiqueue_cancel: I/O error
>>>>> occurred [E 10/18/2012 21:00:50] handle_io_error: flow
>>>>> proto error cleanup started on 0xf18c80: Operation
>>>>> cancelled (possibly due to timeout) [E 10/18/2012 21:00:50]
>>>>> handle_io_error: flow proto 0xf18c80 canceled 1 operations,
>>>>> will clean up. [E 10/18/2012 21:00:50]
>>>>> bmi_recv_callback_fn: I/O error occurred [E 10/18/2012
>>>>> 21:00:50] handle_io_error: flow proto 0xf18c80 error
>>>>> cleanup finished: Operation cancelled (possibly due to
>>>>> time
>>>>>
>>>>>
>>>>> Looking at the mailing list, I've found that increasing
>>>>> these default value (300)
>>>>>
>>>>> ServerJobBMITimeoutSecs 30 ServerJobFlowTimeoutSecs 30
>>>>> ClientJobBMITimeoutSecs 300 ClientJobFlowTimeoutSecs 300
>>>>>
>>>>> to 600.
>>>>>
>>>>> What is at the origin of these timeout ?
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> yves
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -- (o o)
>>>>> --------------------------------------------oOO--(_)--OOo-------
>>>>>
>>>>>
Dr. Yves Revaz
>>>>> Laboratory of Astrophysics EPFL
>>>>>
>>>>> Observatoire de Sauverny Tel : ++ 41 22 379 24 28 51.
>>>>> Ch. des Maillettes Fax : ++ 41 22 379 22 05 1290
>>>>> Sauverny e-mail : [email protected]
>>>>> SWITZERLAND Web :
>>>>> http://www.lunix.ch/revaz/
>>>>> ----------------------------------------------------------------
>>>>>
>>>>>
>>>>>
_______________________________________________
>>>>> Pvfs2-users mailing list
>>>>> [email protected]
>>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>>
>>>>
>>>>
>>>>
>>>>>
- --
>>>> Becky Ligon OrangeFS Support and Development Omnibond
>>>> Systems Anderson, South Carolina
>>>>
>>>>
>>
>>
>> --
>>
>> ----------------------------------------------------------------
>> Dr. Yves Revaz Laboratory of Astrophysics Ecole Polytechnique
>> F←d←rale de Lausanne (EPFL) Observatoire de Sauverny Tel : ++
>> 41 22 379 24 28 51. Ch. des Maillettes Fax : ++ 41 22 379
>> 22 05 1290 Sauverny e-mail : [email protected]
>> SWITZERLAND Web : http://www.lunix.ch/revaz/
>> ----------------------------------------------------------------
>>
>> _______________________________________________ Pvfs2-users
>> mailing list [email protected]
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>
>
>
> _______________________________________________ Pvfs2-users mailing
> list [email protected]
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
Comment: GPGTools - http://gpgtools.org
iQEcBAEBAgAGBQJRE9VcAAoJEFmgPOBxQDtBEYMIAJtgo1LMWxVtyWPa2PNvWr2c
NMUw30GNJ2llhwJVdefpmNqPLdou0Sqr7moAPseA2qYBguER1jqSH0rnXg7yE5TX
CNERJwaL4+99y+tRsvKukrEvegrS/CQ5tUPsiuFaqqcTlQRGYeGPtqJV3JuAsEa2
bu49sN7yWFtM2fY0ZaFa2ouya6PR2mFAdH0ZnpcWr4OTY1Uf4py8njWvvWrMCB/2
I3//H5RoOxhCBIe85RCdXbMh4LMQbwBeTYFePlutE7YplbrQwDLg/K4/ctswRl3T
oKpRy5GJ83LJQomhwWWjAAnWWXe6zNlbiGe/B5APrlgZfV960shxFPeWwej3EEk=
=iXn7
-----END PGP SIGNATURE-----
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users