I have been testing whether PVFS2 can be used to support large-scale read-intensive parallel workload, in particular, post-simulation data analysis. Although the preliminary results (on a small cluster) are encouraging when everything worked, there have been a few occasions where mysterious "Permission Denied" errors occurred and the applications halted.
Below are the system hardware/software setup: - 6 compute nodes each with 8 cores, 16 GB memory, 170 GB free disk space managed by xfs. - Nodes are interconnected by a 1 GigE cable to a 10 GigE switch - Linux kernel: 2.6.22.15-7smp PVFS setup - pvfs-2.7.0 installed - All the 6 nodes also used as both metadata servers and IO servers - The same 6 nodes used to run application codes (as pvfs clients) - pvfs kernel module installed on all the nodes - pvfs mounted with the local hostname specified as the metadata server on each node - regular unix open/read/close calls from within the applications - Default file striping on all the servers Application characteristics: - Parallel Python programs - A large number of parallel read threads - Mostly independent read traces; occasionally shared accesses to the same file but by no more than 2 threads - Large, equally-sized files (> 64 MB) - Each thread opens a file, reads in the content of the entire file (most of the time), extracts data of interest, closes the file and moves to the next file - The sequence of files to be accessed by each thread pre-determined (i.e., no runtime arbitration) - Experiments run on configurations with different number of nodes and different number of cores per node; total number of (read) threads determined by (number of nodes X cores per nodes) Error: - An example (6 nodes, 4 threads per node) : Cannot open file /scratch/mnt/pvfs2/merged_frameset_64MB/p2auto/00000001/trj/frame0000008 44 [Errno 13] Permission denied: '/scratch/mnt/pvfs2/merged_frameset_64MB/p2auto/00000001/trj/frame000000 844' - Similar errors encountered in other node/thread configurations - The files being reported as inaccessible were all verified to be accessible from all the 6 compute/storage nodes Extra information: - On the first trial with PVFS, a different error "[Errno 11] Resource temporarily unavailable" occurred multiple times along with "[Errno 13] Permission denied." - PVFS configuration was changed to increase the number of retry from 5 to 10 and delay from 2 to 2.5 sec - [Errno 11] did not show up again; but [Errno 13] showed up more often Thanks for the help. Tiankai _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
