Hello Murali,

On the client logs Im seeing some errors related to this as follows:

...

[E 08:05:59.628334] PINT_cached_config_get_server_name failed: Invalid argument
[E 08:05:59.628411] Failed to map server address to handle
[E 08:05:59.628422] src/client/sysint/sys-getattr.sm line 708: Error: failed to resolve meta server addresses.
[E 08:05:59.628544] [bt] pvfs2-client-core [0x416f28]
[E 08:05:59.628555] [bt] pvfs2-client-core [0x4159a8]
[E 08:05:59.628564] [bt] pvfs2-client-core(PINT_client_state_machine_testsome+0x1a0) [0x415e90]
[E 08:05:59.628574] [bt] pvfs2-client-core [0x411877]
[E 08:05:59.628583] [bt] pvfs2-client-core(main+0x465) [0x4133c5]
[E 08:05:59.628592] [bt] /lib64/tls/libc.so.6(__libc_start_main+0xea) [0x2a95cd4aaa] [E 08:05:59.628601] [bt] pvfs2-client-core(__strtoll_internal+0x42) [0x40cc5a]

...

Any thing clicks? The server logs are quiet though.

All machines have the following configuration.
db : 4.2
kernel : 2.6.5-7.244-smp
arch : x86_64
distro : SLES 9
and the pvfs is running on IB, I dont think it would be possible to check for TCP. We have used both TCP and IB with pvfs2-1.5 on a smaller cluster. This is a different cluster and a new PVFS so a lot of new variables.

I think Im going to re-install the whole thing with pvfs-2.6.1, just to make sure and try again. I had avoided this since its an expensive step for us timewise, getting to involve the IT guys etc.

Thanks
Vikrant

Hi Vikrant,
Do all these machines bind to the same NIS domain/group/server?
Is it possible that your uids/gids etc dont match up on all the
different machines?
We have had this problem that servers would rely on NIS to set things
up correctly which Sam has fixed in HEAD. I am not sure if that is
what you are seeing here.. could be wrong though..
What distro/version of berk db/kernel version are you running? Do you
see anything on the client-kernel logs or the server logs
(pvfs2-server.log)? Are all the machines 32 bit or 64 bit or a
mixture?
There is something really wrong on your setup..something as simple as
this should work.
BTW: are you using IB or can this problem be repro'ed with tcp as well?
thanks for the reports!
Murali

Some more info about the issue mentioned below:
I can now reproduce this problem consistently by just creating a file on
specific machines, and it seems to depend on whether that particular
machine has just the client running or both server and client running.
In my PVFS configuration it is as follows:
running only client : deva02 and deva03
running both client and server : deva{04,11}

So if I create a file on any machine in the second group(running both
client and server) it is not accessible from the first group(trying to
ls for that file gives "Invalid argument" error ).

Guys, any clue whats happening?

Thanks
Vikrant

Hi,

This is the layout of the file system:
11 clients on deva{02-11}
8 servers on deva{04-11}, each node has four cores.

I created this file on deva02 and try to look for it on deva04, this
is the error i get:

[EMAIL PROTECTED]:/mnt/pvfs2/vsk/fl5l2$ ls test.jou
ls: test.jou: Invalid argument

On deva02 it lists the file correctly. I have waited for much longer
than 30 seconds for this(many minutes and now days). This does not
happen always and usually things work fine. Im not sure what
particular way to get this situation. I had to dig into the console
history to get this output.

This is with a proprietary code, but I will try to send you some
sample MPI code which shows a similar problem soon. We have used this
code successfully with a previous installation of pvfs2-1.5, so looks
like some installation issue or a bug in the current release.
Would the config files and configure options for this installation
help you to identify if its an installation issue?

Thanks
Vikrant

Sam Lang wrote:
Hi Vikrant,

Along with MPI code, if you could send us the output of your shell
commands and the errors you see that would also be helpful in debugging.

Thanks,

-sam

On Dec 20, 2006, at 11:44 AM, Robert Latham wrote:

On Wed, Dec 20, 2006 at 06:55:22PM +0530, Vikrant Kumar wrote:
With MPI applications it fails at certain times in MPI_File_open on
some
nodes, which again looks similar to the above problem.

Can you guys suggest me how to isolate the problem?
Let me know what information you require.
Oh, one more thing that would help is if you can send us the MPI code
you are using.  If we can reproduce the problem on our end, that will
make debugging and fixing a lot easier.

==rob

--Rob Latham
Mathematics and Computer Science Division    A215 0178 EA2D B059 8CDF
Argonne National Lab, IL USA                 B29D F333 664A 4280 315B
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users


_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to