Bart,

  I think the server should print out when conversion starts and ends.

  examples:
  Trove Migration Started: Ver=2.6.3
  Trove Migration Complete: Ver=2.6.3
  Trove Migration Set: 2.8.1

  Does is get that far?

kevin

On Apr 29, 2010, at 1:55 PM, Bart Taylor wrote:

> Thanks for the information and suggestion Phil. Unfortunately, I didn't get a 
> different result after moving that BMI init block. I also managed to 
> reproduce this once while leaving the trove method to alt-aio although that 
> doesn't seem directly related to the direction you were going.
> 
> Another thing I noticed is that I can create files successfully after the 
> upgrade as long as the size is within 64k which is the value of my strip_size 
> distribution param. Once the size exceeds that value, I start running into 
> this problem again.
> 
> Does that help shed any more light on my situation? 
> 
> Bart.
> 
> 
> On Fri, Apr 16, 2010 at 1:39 PM, Phil Carns <[email protected]> wrote:
> Sadly none of my test boxes will run 2.6 any more, but I have a theory about 
> what the problem might be here.
> 
> For some background, the pvfs2-server daemon does these steps in order (among 
> others): initializes BMI (networking), initializes Trove (storage), and then 
> finally starts processing requests.
> 
> In your case, two extra things are going on:
> 
> - the trove initialization may take a while, because it has to do a 
> conversion of the 
> format for all objects from v. 2.6 to 2.8, especially if it is also switching 
> to o_direct format at the same time.
> 
> - whichever server gets done first is going to immediately contact the other 
> servers in order to precreate handles for new files (a new feature in 2.8)
> 
> I'm guessing that one server finished the trove conversion before the others 
> and started its pre-create requests.  The other servers can't answer yet 
> (because they are still busy with trove), but since BMI is already running 
> the incoming precreate requests just get queued up on the socket.  When the 
> slow server finally does try to service them, the requests are way out of 
> date and have since been retried by the fast server.
> 
> I'm not sure exactly what goes wrong from there, but if that's the cause, the 
> solution might be relatively simple.  If you look in pvfs2-server.c, you can 
> take the block of code from "BMI_initialize(...)" to "*server_status_flag |= 
> SERVER_BMI_INIT;" and try moving that whole block to _after_ the 
> "*server_status_flag |= SERVER_TROVE_INIT;" line that indicates that trove is 
> done.
> 
> -Phil
> 
> 
> On 03/30/2010 06:23 PM, Bart Taylor wrote:
>> 
>> I am having some problems upgrading existing file systems to 2.8. After I 
>> finish the upgrade and start the file system, I cannot create files. Simple 
>> commands like dd and cp stall until they timeout and leave partial dirents 
>> like this:
>> 
>> [bat...@client t]$ dd if=/dev/zero of=/mnt/pvfs28/10MB.dat.6 bs=1M count=10
>> dd: writing `/mnt/pvfs28/10MB.dat.6': Connection timed out
>> 1+0 records in
>> 0+0 records out
>> 0 bytes (0 B) copied, 180.839 seconds, 0.0 kB/s
>> 
>> 
>> [r...@client ~]# ls -alh /mnt/pvfs28/
>> total 31M
>> drwxrwxrwt 1 root   root   4.0K Mar 30 11:24 .
>> drwxr-xr-x 4 root   root   4.0K Mar 23 13:38 ..
>> -rw-rw-r-- 1 batayl batayl  10M Mar 30 08:44 10MB.dat.1
>> -rw-rw-r-- 1 batayl batayl  10M Mar 30 08:44 10MB.dat.2
>> -rw-rw-r-- 1 batayl batayl  10M Mar 30 08:44 10MB.dat.3
>> ?--------- ? ?      ?         ?            ? 10MB.dat.5
>> drwxrwxrwx 1 root   root   4.0K Mar 29 14:06 lost+found
>> 
>> 
>> This happens both on local disk and on network storage, but it only happens 
>> if the upgraded file system starts up the first time using directio. If it 
>> is started with alt-aio as the TroveMethod, everything works as expected. It 
>> also only happens the first time the file system is started; if I stop the 
>> server daemons and restart them, everything operates as expected. I do have 
>> to kill -9 the server deamons, since they will not exit gracefully.
>> 
>> My test is running on RHEL4 U8 i386 with kernel version 2.6.9-89.ELsmp with 
>> two server nodes and one client. I was unable to recreate the problem with a 
>> single server. 
>> 
>> I attached verbose server logs from the time the daemon was started after 
>> the upgrade until the client failed as well as client logs from mount until 
>> the returned error. The cliffs notes are that one of the servers logs as 
>> many unstuff requests as we have client retries configured. The client fails 
>> at the end of the allotted retries. The other server doesn't log anythign 
>> after starting. 
>> 
>> Has anyone seen anything similar or know what might be going on? 
>> 
>> Bart.
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Pvfs2-developers mailing list
>> 
>> [email protected]
>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>> 
>>   
>> 
> 
> 
> _______________________________________________
> Pvfs2-developers mailing list
> [email protected]
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to