On Wed, Mar 21, 2007 at 04:00:40PM -0700, Jan Lindheim wrote: > We have found that when trying to use pvfs with romio under openmpi, > we are getting errors when the task count is bigger than 128, using > 1MB messages. Smaller message sizes and larger task counts also cause > the same error to be generated, just not as consistently or quickly. > Errors that we see look like: > > [E 15:05:50.012128] job_time_mgr_expire: job time out: cancelling bmi > operation, job_id: 34. > [E 15:05:50.012380] msgpair failed, will retry: Operation cancelled (possibly > due to timeout)
Just want to understand your workload a bit: You are doing a collective write with 128 processes each writing 1MB, right? > Writing to an NFS mounted file system instead of PVFS, works fine even > with 256 tasks. > Our version of PVFS is 2.6.2. Both openmpi 1.1.x and 1.2 produce the > same errors. Any known limitations with romio and PVFS? > We can supply you with a test code if you are interested in reproducing > the problem. The code should compile well with mpich as well as > openmpi. Go ahead and send the test code, but it really looks like you are pushing the servers hard and hitting a timeout. How many servers do you have for this many clients? PVFS should be smarter about such a situation, but could you check something for us? In your fs.conf, what is the value of ServerJobBMITimeoutSecs ? http://www.pvfs.org/pvfs2-options.html#ServerJobBMITimeoutSecs If you increase that value to, say, 3600, we can ensure the timeouts won't get triggered. I have a few other ideas, but let's try this one first. ==rob -- Rob Latham Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF Argonne National Lab, IL USA B29D F333 664A 4280 315B _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
