-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Randy,
As long as you're working on IB patches. I just remembered that I had
to apply a patch before I could get 2.8.7 to build on my CentOS 5
machines running their stock IB stack.

- --- src/io/bmi/bmi_ib/openib.c.orig     2013-01-10 15:47:52.000000000
- -0700
+++ src/io/bmi/bmi_ib/openib.c  2013-01-10 15:37:59.000000000 -0700
@@ -745,7 +745,9 @@
 #ifdef HAVE_IBV_EVENT_CLIENT_REREGISTER
        CASE(IBV_EVENT_CLIENT_REREGISTER);
 #endif
+#ifdef HAVE_IBV_EVENT_GID_CHANGE
        CASE(IBV_EVENT_GID_CHANGE);
+#endif
     }
     return s;
 }

The issue was brought up in a thread on this list last summer, but I
never saw a final resolution and if there was one it apparently didn't
make it into 2.8.7

Thanks,
Mike Robbert
Colorado School of Mines

On 2/7/13 8:20 AM, Randall Martin wrote:
> I'm working on a set of patches for the IB support.  There are
> several issues I'm working through on the patches before I commit
> them.  I'll send you a copy when I have them ready for release so
> you can test them.
> 
> 
> -Randy
> 
> 
> On 2/7/13 8:54 AM, "Yves Revaz" <[email protected]> wrote:
> 
>> On 10/18/2012 11:41 PM, Kyle Schochenmaier wrote:
>>> Hi Yves -
>>> 
>>> How frequently do you see these warnings?  Does it cause any 
>>> servers/clients to hang?
>> 
>> Hi Kyle and the list,
>> 
>> In a previous mail, I was mentioning the following errors:
>> 
>> [E 02/07/2013 14:39:24] Warning: encourage_recv_incoming: mop_id
>> d0e680 in RTS_DONE message not found. [E 02/07/2013 14:39:54]
>> job_time_mgr_expire: job time out: cancelling flow operation,
>> job_id: 17549115350. [E 02/07/2013 14:39:54]
>> fp_multiqueue_cancel: flow proto cancel called on 0x1bce5e0 [E
>> 02/07/2013 14:39:54] fp_multiqueue_cancel: I/O error occurred [E
>> 02/07/2013 14:39:54] handle_io_error: flow proto error cleanup 
>> started on 0x1bce5e0: Operation cancelled (possibly due to
>> timeout) [E 02/07/2013 14:39:54] handle_io_error: flow proto
>> 0x1bce5e0 canceled 1 operations, will clean up. [E 02/07/2013
>> 14:39:54] bmi_recv_callback_fn: I/O error occurred [E 02/07/2013
>> 14:39:54] handle_io_error: flow proto 0x1bce5e0 error cleanup
>> finished: Operation cancelled (possibly due to timeout)
>> 
>> In fact, I'm trying to move 10Tb of data in our pvfs, using and
>> rsync. When a lot of data are transfered, those errors occurs
>> very frequently, about every 5 minutes, which is very annoying.
>> 
>> I've checked our IB network which is perfectly sane. I'm
>> currently using orangefs-2.8.6/. Should I move to 2.8.7 ? Looking
>> at the changelog of the 2.8.7 realease, I don't thinks IB related
>> problems have been fixed.
>> 
>> Thanks,
>> 
>> yves
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> If not common/destructive this could be that there was a simple
>>> error case on the infiniband fabric and that the operation
>>> timed out in pvfs and that can be readily ignored as it would
>>> be retransmitted eventually.
>>> 
>>> If you see this a lot it may be one of a few issues that we've
>>> fixed in recent releases, which version of orangefs/pvfs are
>>> you using? ~Kyle
>>> 
>>> Kyle Schochenmaier
>>> 
>>> 
>>> On Thu, Oct 18, 2012 at 4:31 PM, Becky
>>> Ligon<[email protected]>  wrote:
>>>> Yves:
>>>> 
>>>> The timeouts that you listed below are in the configuration
>>>> file.
>>>> 
>>>> ClientJobBMITimeoutSecs 300 - The client's job scheduler
>>>> limits each "job" sent across the network to this timeout.
>>>> If the job exceeds this limit, the job is cancelled.
>>>> Depending on the request, the job may be retried. Keep in
>>>> mind that one PVFS request can be made up of many jobs.
>>>> 
>>>> ClientJobFlowTimeoutSecs - This value limits the time spent
>>>> on a particular job called a flow.  A flow is used to
>>>> transfer data across the network to a server or to transfer
>>>> data from a server to the client.    Again, if the flow
>>>> exceeds this timeout, then the flow is cancelled.
>>>> 
>>>> The server counterparts for these settings are rarely used,
>>>> since the server doesn't normally initiate reads or writes.
>>>> 
>>>> I think your real problem has something to do with IB, but I
>>>> am not an expert in that area.  I have cc'd Kyle
>>>> Schochenmaier to see if he can help.
>>>> 
>>>> Becky
>>>> 
>>>> 
>>>> 
>>>> On Thu, Oct 18, 2012 at 4:07 PM, Yves
>>>> Revaz<[email protected]>  wrote:
>>>>> 
>>>>> Dear list,
>>>>> 
>>>>> I sometimes have the following error occuring in my pvfs
>>>>> server log.
>>>>> 
>>>>> [E 10/18/2012 20:59:50] Warning: encourage_recv_incoming:
>>>>> mop_id 150c320 in RTS_DONE message not found. [E 10/18/2012
>>>>> 21:00:50] job_time_mgr_expire: job time out: cancelling 
>>>>> flow operation, job_id: 33307291. [E 10/18/2012 21:00:50]
>>>>> fp_multiqueue_cancel: flow proto cancel called on 0xf18c80 
>>>>> [E 10/18/2012 21:00:50] fp_multiqueue_cancel: I/O error
>>>>> occurred [E 10/18/2012 21:00:50] handle_io_error: flow
>>>>> proto error cleanup started on 0xf18c80: Operation
>>>>> cancelled (possibly due to timeout) [E 10/18/2012 21:00:50]
>>>>> handle_io_error: flow proto 0xf18c80 canceled 1 operations,
>>>>> will clean up. [E 10/18/2012 21:00:50]
>>>>> bmi_recv_callback_fn: I/O error occurred [E 10/18/2012
>>>>> 21:00:50] handle_io_error: flow proto 0xf18c80 error 
>>>>> cleanup finished: Operation cancelled (possibly due to
>>>>> time
>>>>> 
>>>>> 
>>>>> Looking at the mailing list, I've found that increasing
>>>>> these default value (300)
>>>>> 
>>>>> ServerJobBMITimeoutSecs 30 ServerJobFlowTimeoutSecs 30 
>>>>> ClientJobBMITimeoutSecs 300 ClientJobFlowTimeoutSecs 300
>>>>> 
>>>>> to 600.
>>>>> 
>>>>> What is at the origin of these  timeout ?
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> 
>>>>> yves
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -- (o o) 
>>>>> --------------------------------------------oOO--(_)--OOo-------
>>>>>
>>>>> 
Dr. Yves Revaz
>>>>> Laboratory of Astrophysics EPFL
>>>>> 
>>>>> Observatoire de Sauverny     Tel : ++ 41 22 379 24 28 51.
>>>>> Ch. des Maillettes       Fax : ++ 41 22 379 22 05 1290
>>>>> Sauverny             e-mail : [email protected] 
>>>>> SWITZERLAND                  Web :
>>>>> http://www.lunix.ch/revaz/ 
>>>>> ----------------------------------------------------------------
>>>>>
>>>>>
>>>>> 
_______________________________________________
>>>>> Pvfs2-users mailing list 
>>>>> [email protected] 
>>>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>>
>>>>
>>>>
>>>>
>>>>> 
- --
>>>> Becky Ligon OrangeFS Support and Development Omnibond
>>>> Systems Anderson, South Carolina
>>>> 
>>>> 
>> 
>> 
>> --
>> 
>> ---------------------------------------------------------------- 
>> Dr. Yves Revaz Laboratory of Astrophysics Ecole Polytechnique
>> F←d←rale de Lausanne (EPFL) Observatoire de Sauverny     Tel : ++
>> 41 22 379 24 28 51. Ch. des Maillettes       Fax : ++ 41 22 379
>> 22 05 1290 Sauverny             e-mail : [email protected] 
>> SWITZERLAND                  Web : http://www.lunix.ch/revaz/ 
>> ----------------------------------------------------------------
>> 
>> _______________________________________________ Pvfs2-users
>> mailing list [email protected] 
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
> 
> 
> 
> _______________________________________________ Pvfs2-users mailing
> list [email protected] 
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJRE9VcAAoJEFmgPOBxQDtBEYMIAJtgo1LMWxVtyWPa2PNvWr2c
NMUw30GNJ2llhwJVdefpmNqPLdou0Sqr7moAPseA2qYBguER1jqSH0rnXg7yE5TX
CNERJwaL4+99y+tRsvKukrEvegrS/CQ5tUPsiuFaqqcTlQRGYeGPtqJV3JuAsEa2
bu49sN7yWFtM2fY0ZaFa2ouya6PR2mFAdH0ZnpcWr4OTY1Uf4py8njWvvWrMCB/2
I3//H5RoOxhCBIe85RCdXbMh4LMQbwBeTYFePlutE7YplbrQwDLg/K4/ctswRl3T
oKpRy5GJ83LJQomhwWWjAAnWWXe6zNlbiGe/B5APrlgZfV960shxFPeWwej3EEk=
=iXn7
-----END PGP SIGNATURE-----
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to