Sam and Nathan,

Thanks for your quick replies, and thank you for your suggestions. I tried
them both, but it turned out that it was a BlueGene/L mount problem, and not
a PVFS2 problem at all.

On Friday, we turned on additional debugging messages, and the following
messages appeared in the PVFS2 logs on the metadata server:

[A 04/20 17:36] [EMAIL PROTECTED] H=1048576 S=0x2a96943530: 
   lookup_path: path: mattheww, handle: 1047133
[A 04/20 17:36] [EMAIL PROTECTED] H=1048576 S=0x2a96943530: 
   lookup_path: finish (Success)
[A 04/20 17:36] [EMAIL PROTECTED] H=1047133 S=0x2a967e0470: 
   lookup_path: path: _file_258_co, lookup failed
[A 04/20 17:36] [EMAIL PROTECTED] H=1047133 S=0x2a967e0470: 
   lookup_path: finish (No such file or directory)

This had us convinced that it was a problem related to the metadata server,
and that all of the clients were sending requests properly. To investigate
further, we ran a very simple program that just barriers, calls fopen once
for each process with a filename based on rank, and sums the number of file
descriptors that were returned. Sure enough, at cases > 256 tasks, some of
the fopen() calls didn't return a file descriptor. 

With that, it didn't seem to be a MPI-IO problem. Looking at client logs, we
found that when booting over 8 32-node partitions, some of the partitions
weren't properly mounting pvfs2. A few changes to remount if required during
the boot process fixed the problem. Since then, everything's worked fine. I
suppose the moral is: "Make sure that your clients are mounting the file
system!" Those lookup_failed messages were quite perplexing and definitely
led me to look in the wrong place first.

My apologies for bothering everyone with this, and again, thanks for your
quick offers with assistance. I really appreciate it! 

Matthew

-----Original Message-----
From: Sam Lang [mailto:[EMAIL PROTECTED] 
Sent: Friday, April 20, 2007 4:53 PM
To: Matthew Woitaszek
Cc: [email protected]
Subject: Re: [Pvfs2-users] PVFS2 on BlueGene



Hi Matthew,

Does mpi-io-test consistently fail with 257 nodes (9 IO nodes), or do  
you get any successful runs there?    Are there any messages in the  
pvfs server logs (/tmp/pvfs2-server.log)?

Thanks,

-sam


On Apr 20, 2007, at 4:25 PM, Matthew Woitaszek wrote:

>
> Good afternoon,
>
> Michael Oberg and I are attempting to get PVFS2 working on NCAR's 1- 
> rack
> BlueGene/L system using ZeptoOS. We ran into a snag at over 8 BG/L  
> I/O nodes
> (>256 compute nodes).
>
> We've been using the mpi-io-test program shipped with PVFS2 to test  
> the
> system. For cases up to and including 8 I/O nodes (256 coprocessor  
> or 512
> virtual node mode tasks), everything works fine. Larger jobs fail  
> with file
> not found error messages, such as:
>
>    MPI_File_open: File does not exist, error stack:
>    ADIOI_BGL_OPEN(54): File /pvfs2/mattheww/_file_0512_co does not  
> exist
>
> The file is created on the PVFS2 filesystem and has a zero-byte  
> size. We've
> run the tests with 512 tasks on 256 nodes, and it successfully  
> created a
> 8589934592-byte file. Going to 257 nodes fails.
>
> Has anyone seen this behavior before? Are there any PVFS2 server or  
> client
> configuration options that you would recommend for a BG/L  
> installation like
> this?
>
> Thanks for your time,
>
> Matthew
>
>
>
> _______________________________________________
> Pvfs2-users mailing list
> [email protected]
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>



_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to