Scott and Rob,
PAV is the pvfs autovolume service, it allows me to start pvfs for a
job on the compute nodes I've scheduled. Effectively, its a remote
configuration tool that takes a config file and configures and starts
the pvfs servers on a subset of my job's nodes.
Additional requested info . . .
MX version:
[brad...@node0394:bradles-pav:1009]$ mx_info
MX Version: 1.2.7
MX Build: w...@node0002:/home/wolf/rpm/BUILD/mx-1.2.7 Wed Dec 3
09:21:26 EST 2008
1 Myrinet board installed.
The MX driver is configured to support a maximum of:
16 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
===================================================================
Instance #0: 313.6 MHz LANai, PCI-E x8, 2 MB SRAM, on NUMA node 0
Status: Running, P0: Link Up
Network: Myrinet 10G
MAC Address: 00:60:dd:47:23:4e
Product code: 10G-PCIE-8A-C
Part number: 09-03327
Serial number: 338892
Mapper: 00:60:dd:47:21:dd, version = 0x00000063, configured
Mapped hosts: 772
Pvfs2 is version 2.7.1, built with the mx turned on and the tcp turned
off. I can copy files out of the file system, but writing to the file
system is precarious. Data gets written in, but seems to hang or
something. Here is my job output using mpi-io-test:
time -p mpiexec -n 2 -npernode 1
/home/bradles/software/anl-io-test/bin/anl-io-test-mx -f
pvfs2:/tmp/bradles-pav/mount/anl-io-data
# Using mpi-io calls.
[E 12:21:32.047891] job_time_mgr_expire: job time out: cancelling bmi
operation, job_id: 3.
[E 12:21:32.058035] msgpair failed, will retry: Operation cancelled
(possibly due to timeout)
[E 12:26:32.217723] job_time_mgr_expire: job time out: cancelling bmi
operation, job_id: 56.
[E 12:26:32.227774] msgpair failed, will retry: Operation cancelled
(possibly due to timeout)
=>> PBS: job killed: walltime 610 exceeded limit 600
This is writing 32MB into a file. The data seems to all be there
(file size is 33554432), but the writes must not ever return I guess.
I don't know how to diagnose what is the matter. Any help is much
appreciated.
Thanks
Brad
On Thu, Mar 5, 2009 at 9:41 AM, Scott Atchley <[email protected]> wrote:
> On Mar 5, 2009, at 8:46 AM, Robert Latham wrote:
>
>> On Wed, Mar 04, 2009 at 07:15:24PM -0500, Bradley Settlemyer wrote:
>>>
>>> Hello
>>>
>>> I am trying to use PAV to run pvfs with the MX protocol. I've
>>> updated pav so that servers start and ping correctly. But when I try
>>> and run an mpi code, I'm getting client timeouts like the client
>>> cannot contact the servers:
>>>
>>> Lots of this stuff:
>>>
>>> [E 19:11:02.573509] job_time_mgr_expire: job time out: cancelling bmi
>>> operation, job_id: 3.
>>> [E 19:11:02.583659] msgpair failed, will retry: Operation cancelled
>>> (possibly due to timeout)
>
> Brad, which version of MX and PVFS2?
>
>> OK, so pvfs utilities are all hunky-dory? not just pvfs2-ping but
>> pvfs2-cp and pvfs2-ls?
>>
>> On Jazz, I usually configure MPICH2 to communicate over TCP and have
>> the PVFS system interface communicate over MX. This keeps the
>> situation fairly simple, but of course you get awful MPI performance.
>>
>> Does MX still have the "ports" restriction that GM has? I wonder if
>> MPI communication is getting in the way of PVFS communication...
>>
>> In short, I don't exactly know what's wrong myself. just tossing out
>> some theories.
>>
>> ==rob
>
> Rob, MX is limited to 8 endpoints per NIC. One can use mx_info to get the
> number:
>
> 8 endpoints per NIC, 1024 NICs on the network, 32 NICs per host
>
> This can be increased to 16 with a module parameter.
>
> Generally, you want no more than one endpoint per process and one process
> per core for MPI. When you want to use MPI-IO over PVFS2, each process will
> need two endpoints (one for MPI and one for PVFS2). If you have eight cores,
> you should increase the max endpoints to 16 (if you have eight cores).
>
> Generally, I would not want to limit my MPI to TCP and IO to MX especially
> if my TCP is over gigabit Ethernet. Unless your IO can exceed the link rate,
> there will be plenty of bandwidth left over for MPI and your latency will
> stay much lower than TCP.
>
> What is PAV?
>
> Scott
>
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users