On 10/18/2012 11:41 PM, Kyle Schochenmaier wrote:
Hi Yves -

How frequently do you see these warnings?  Does it cause any
servers/clients to hang?

Hi Kyle and the list,

In a previous mail, I was mentioning the following errors:

[E 02/07/2013 14:39:24] Warning: encourage_recv_incoming: mop_id d0e680 in RTS_DONE message not found. [E 02/07/2013 14:39:54] job_time_mgr_expire: job time out: cancelling flow operation, job_id: 17549115350. [E 02/07/2013 14:39:54] fp_multiqueue_cancel: flow proto cancel called on 0x1bce5e0
[E 02/07/2013 14:39:54] fp_multiqueue_cancel: I/O error occurred
[E 02/07/2013 14:39:54] handle_io_error: flow proto error cleanup started on 0x1bce5e0: Operation cancelled (possibly due to timeout) [E 02/07/2013 14:39:54] handle_io_error: flow proto 0x1bce5e0 canceled 1 operations, will clean up.
[E 02/07/2013 14:39:54] bmi_recv_callback_fn: I/O error occurred
[E 02/07/2013 14:39:54] handle_io_error: flow proto 0x1bce5e0 error cleanup finished: Operation cancelled (possibly due to timeout)

In fact, I'm trying to move 10Tb of data in our pvfs, using and rsync.
When a lot of data are transfered, those errors occurs very frequently, about every 5 minutes, which
is very annoying.

I've checked our IB network which is perfectly sane.
I'm currently using orangefs-2.8.6/. Should I move to 2.8.7 ?
Looking at the changelog of the 2.8.7 realease, I don't thinks IB related problems
have been fixed.

Thanks,

yves














If not common/destructive this could be that there was a simple error
case on the infiniband fabric and that the operation timed out in pvfs
and that can be readily ignored as it would be retransmitted
eventually.

If you see this a lot it may be one of a few issues that we've fixed
in recent releases, which version of orangefs/pvfs are you using?
~Kyle

Kyle Schochenmaier


On Thu, Oct 18, 2012 at 4:31 PM, Becky Ligon<[email protected]>  wrote:
Yves:

The timeouts that you listed below are in the configuration file.

ClientJobBMITimeoutSecs 300 - The client's job scheduler limits each "job"
sent across the network to this timeout.  If the job exceeds this limit, the
job is cancelled.  Depending on the request, the job may be retried.  Keep
in mind that one PVFS request can be made up of many jobs.

ClientJobFlowTimeoutSecs - This value limits the time spent on a particular
job called a flow.  A flow is used to transfer data across the network to a
server or to transfer data from a server to the client.    Again, if the
flow exceeds this timeout, then the flow is cancelled.

The server counterparts for these settings are rarely used, since the server
doesn't normally initiate reads or writes.

I think your real problem has something to do with IB, but I am not an
expert in that area.  I have cc'd Kyle Schochenmaier to see if he can help.

Becky



On Thu, Oct 18, 2012 at 4:07 PM, Yves Revaz<[email protected]>  wrote:

Dear list,

I sometimes have the following error occuring in my pvfs server log.

[E 10/18/2012 20:59:50] Warning: encourage_recv_incoming: mop_id 150c320
in RTS_DONE message not found.
[E 10/18/2012 21:00:50] job_time_mgr_expire: job time out: cancelling flow
operation, job_id: 33307291.
[E 10/18/2012 21:00:50] fp_multiqueue_cancel: flow proto cancel called on
0xf18c80
[E 10/18/2012 21:00:50] fp_multiqueue_cancel: I/O error occurred
[E 10/18/2012 21:00:50] handle_io_error: flow proto error cleanup started
on 0xf18c80: Operation cancelled (possibly due to timeout)
[E 10/18/2012 21:00:50] handle_io_error: flow proto 0xf18c80 canceled 1
operations, will clean up.
[E 10/18/2012 21:00:50] bmi_recv_callback_fn: I/O error occurred
[E 10/18/2012 21:00:50] handle_io_error: flow proto 0xf18c80 error cleanup
finished: Operation cancelled (possibly due to time


Looking at the mailing list, I've found that increasing these default
value (300)

         ServerJobBMITimeoutSecs 30
         ServerJobFlowTimeoutSecs 30
         ClientJobBMITimeoutSecs 300
         ClientJobFlowTimeoutSecs 300

to 600.

What is at the origin of these  timeout ?

Thanks,


yves





--
                                                  (o o)
--------------------------------------------oOO--(_)--OOo-------
   Dr. Yves Revaz
   Laboratory of Astrophysics EPFL

   Observatoire de Sauverny     Tel : ++ 41 22 379 24 28
   51. Ch. des Maillettes       Fax : ++ 41 22 379 22 05
   1290 Sauverny             e-mail : [email protected]
   SWITZERLAND                  Web : http://www.lunix.ch/revaz/
----------------------------------------------------------------

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users



--
Becky Ligon
OrangeFS Support and Development
Omnibond Systems
Anderson, South Carolina




--

----------------------------------------------------------------
  Dr. Yves Revaz
  Laboratory of Astrophysics
  Ecole Polytechnique Fédérale de Lausanne (EPFL)
  Observatoire de Sauverny     Tel : ++ 41 22 379 24 28
  51. Ch. des Maillettes       Fax : ++ 41 22 379 22 05
  1290 Sauverny             e-mail : [email protected]
  SWITZERLAND                  Web : http://www.lunix.ch/revaz/
----------------------------------------------------------------

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to