Re: [Pvfs2-developers] Problem upgrading from 2.6 to 2.8

Phil Carns Fri, 16 Apr 2010 12:40:19 -0700

Sadly none of my test boxes will run 2.6 any more, but I have a theoryabout what the problem might be here.

For some background, the pvfs2-server daemon does these steps in order(among others): initializes BMI (networking), initializes Trove(storage), and then finally starts processing requests.


In your case, two extra things are going on:

- the trove initialization may take a while, because it has to do aconversion of theformat for all objects from v. 2.6 to 2.8, especially if it is alsoswitching to o_direct format at the same time.

- whichever server gets done first is going to immediately contact theother servers in order to precreate handles for new files (a new featurein 2.8)

I'm guessing that one server finished the trove conversion before theothers and started its pre-create requests. The other servers can'tanswer yet (because they are still busy with trove), but since BMI isalready running the incoming precreate requests just get queued up onthe socket. When the slow server finally does try to service them, therequests are way out of date and have since been retried by the fast server.

I'm not sure exactly what goes wrong from there, but if that's thecause, the solution might be relatively simple. If you look inpvfs2-server.c, you can take the block of code from"BMI_initialize(...)" to "*server_status_flag |= SERVER_BMI_INIT;" andtry moving that whole block to _after_ the "*server_status_flag |=SERVER_TROVE_INIT;" line that indicates that trove is done.


-Phil

On 03/30/2010 06:23 PM, Bart Taylor wrote:

I am having some problems upgrading existing file systems to 2.8.After I finish the upgrade and start the file system, I cannot createfiles. Simple commands like dd and cp stall until they timeout andleave partial dirents like this:
[bat...@client t]$ dd if=/dev/zero of=/mnt/pvfs28/10MB.dat.6 bs=1Mcount=10
dd: writing `/mnt/pvfs28/10MB.dat.6': Connection timed out
1+0 records in
0+0 records out
0 bytes (0 B) copied, 180.839 seconds, 0.0 kB/s


[r...@client ~]# ls -alh /mnt/pvfs28/
total 31M
drwxrwxrwt 1 root   root   4.0K Mar 30 11:24 .
drwxr-xr-x 4 root   root   4.0K Mar 23 13:38 ..
-rw-rw-r-- 1 batayl batayl  10M Mar 30 08:44 10MB.dat.1
-rw-rw-r-- 1 batayl batayl  10M Mar 30 08:44 10MB.dat.2
-rw-rw-r-- 1 batayl batayl  10M Mar 30 08:44 10MB.dat.3
?--------- ? ?      ?         ?            ? 10MB.dat.5
drwxrwxrwx 1 root   root   4.0K Mar 29 14:06 lost+found
This happens both on local disk and on network storage, but it onlyhappens if the upgraded file system starts up the first time usingdirectio. If it is started with alt-aio as the TroveMethod, everythingworks as expected. It also only happens the first time the file systemis started; if I stop the server daemons and restart them, everythingoperates as expected. I do have to kill -9 the server deamons, sincethey will not exit gracefully.
My test is running on RHEL4 U8 i386 with kernel version 2.6.9-89.ELsmpwith two server nodes and one client. I was unable to recreate theproblem with a single server.
I attached verbose server logs from the time the daemon was startedafter the upgrade until the client failed as well as client logs frommount until the returned error. The cliffs notes are that one of theservers logs as many unstuff requests as we have client retriesconfigured. The client fails at the end of the allotted retries. Theother server doesn't log anythign after starting.
Has anyone seen anything similar or know what might be going on?

Bart.




_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] Problem upgrading from 2.6 to 2.8

Reply via email to