I have been testing whether PVFS2 can be used to support large-scale
read-intensive parallel workload, in particular, post-simulation data
analysis. Although the preliminary results (on a small cluster) are
encouraging when everything worked, there have been a few occasions
where mysterious "Permission Denied" errors occurred and the
applications halted.

Below are the system hardware/software setup:

- 6 compute nodes each with 8 cores, 16 GB memory, 170 GB free disk
space managed by xfs. 
- Nodes are interconnected by a 1 GigE cable to a 10 GigE switch
- Linux kernel: 2.6.22.15-7smp

PVFS setup

- pvfs-2.7.0 installed
- All the 6 nodes also used as both metadata servers and IO servers 
- The same 6 nodes used to run application codes (as pvfs clients)
- pvfs kernel module installed on all the nodes
- pvfs mounted with the local hostname specified as the metadata server
on each node
- regular unix open/read/close calls from within the applications
- Default file striping on all the servers

Application characteristics:

- Parallel Python programs
- A large number of parallel read threads 
- Mostly independent read traces; occasionally shared accesses to the
same file but by no more than 2 threads
- Large, equally-sized files (> 64 MB)
- Each thread opens a file, reads in the content of the entire file
(most of the time), extracts data of interest, closes the file and moves
to the next file
- The sequence of files to be accessed by each thread pre-determined
(i.e., no runtime arbitration)
- Experiments run on configurations with different number of nodes and
different number of cores per node; total number of (read) threads
determined by (number of nodes X cores per nodes)

Error:
- An example (6 nodes, 4 threads per node) : Cannot open file
/scratch/mnt/pvfs2/merged_frameset_64MB/p2auto/00000001/trj/frame0000008
44 [Errno 13] Permission denied:
'/scratch/mnt/pvfs2/merged_frameset_64MB/p2auto/00000001/trj/frame000000
844'
- Similar errors encountered in other node/thread configurations
- The files being reported as inaccessible were all verified to be
accessible from all the 6 compute/storage nodes 


Extra information:
- On the first trial with PVFS, a different error "[Errno 11] Resource
temporarily unavailable" occurred multiple times along with "[Errno 13]
Permission denied."
- PVFS configuration was changed to increase the number of retry from 5
to 10 and delay from 2 to 2.5 sec
- [Errno 11] did not show up again; but [Errno 13] showed up more often

Thanks for the help.
Tiankai



_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to