Re: [Pvfs2-developers] Problem upgrading from 2.6 to 2.8

Bart Taylor Thu, 29 Apr 2010 13:51:16 -0700

Yes, it does finish the Trove Migration and print similar messages. The file
system responds to requests; I just can't create files larger than one strip
size.  Once I restart the file system I can, but on first start, they fail.


Bart.



On Thu, Apr 29, 2010 at 1:50 PM, Kevin Harms <[email protected]> wrote:

> Bart,
>
>  I think the server should print out when conversion starts and ends.
>
>  examples:
>  Trove Migration Started: Ver=2.6.3
>  Trove Migration Complete: Ver=2.6.3
>  Trove Migration Set: 2.8.1
>
>  Does is get that far?
>
> kevin
>
> On Apr 29, 2010, at 1:55 PM, Bart Taylor wrote:
>
> > Thanks for the information and suggestion Phil. Unfortunately, I didn't
> get a different result after moving that BMI init block. I also managed to
> reproduce this once while leaving the trove method to alt-aio although that
> doesn't seem directly related to the direction you were going.
> >
> > Another thing I noticed is that I can create files successfully after the
> upgrade as long as the size is within 64k which is the value of my
> strip_size distribution param. Once the size exceeds that value, I start
> running into this problem again.
> >
> > Does that help shed any more light on my situation?
> >
> > Bart.
> >
> >
> > On Fri, Apr 16, 2010 at 1:39 PM, Phil Carns <[email protected]> wrote:
> > Sadly none of my test boxes will run 2.6 any more, but I have a theory
> about what the problem might be here.
> >
> > For some background, the pvfs2-server daemon does these steps in order
> (among others): initializes BMI (networking), initializes Trove (storage),
> and then finally starts processing requests.
> >
> > In your case, two extra things are going on:
> >
> > - the trove initialization may take a while, because it has to do a
> conversion of the
> > format for all objects from v. 2.6 to 2.8, especially if it is also
> switching to o_direct format at the same time.
> >
> > - whichever server gets done first is going to immediately contact the
> other servers in order to precreate handles for new files (a new feature in
> 2.8)
> >
> > I'm guessing that one server finished the trove conversion before the
> others and started its pre-create requests.  The other servers can't answer
> yet (because they are still busy with trove), but since BMI is already
> running the incoming precreate requests just get queued up on the socket.
>  When the slow server finally does try to service them, the requests are way
> out of date and have since been retried by the fast server.
> >
> > I'm not sure exactly what goes wrong from there, but if that's the cause,
> the solution might be relatively simple.  If you look in pvfs2-server.c, you
> can take the block of code from "BMI_initialize(...)" to
> "*server_status_flag |= SERVER_BMI_INIT;" and try moving that whole block to
> _after_ the "*server_status_flag |= SERVER_TROVE_INIT;" line that indicates
> that trove is done.
> >
> > -Phil
> >
> >
> > On 03/30/2010 06:23 PM, Bart Taylor wrote:
> >>
> >> I am having some problems upgrading existing file systems to 2.8. After
> I finish the upgrade and start the file system, I cannot create files.
> Simple commands like dd and cp stall until they timeout and leave partial
> dirents like this:
> >>
> >> [bat...@client t]$ dd if=/dev/zero of=/mnt/pvfs28/10MB.dat.6 bs=1M
> count=10
> >> dd: writing `/mnt/pvfs28/10MB.dat.6': Connection timed out
> >> 1+0 records in
> >> 0+0 records out
> >> 0 bytes (0 B) copied, 180.839 seconds, 0.0 kB/s
> >>
> >>
> >> [r...@client ~]# ls -alh /mnt/pvfs28/
> >> total 31M
> >> drwxrwxrwt 1 root   root   4.0K Mar 30 11:24 .
> >> drwxr-xr-x 4 root   root   4.0K Mar 23 13:38 ..
> >> -rw-rw-r-- 1 batayl batayl  10M Mar 30 08:44 10MB.dat.1
> >> -rw-rw-r-- 1 batayl batayl  10M Mar 30 08:44 10MB.dat.2
> >> -rw-rw-r-- 1 batayl batayl  10M Mar 30 08:44 10MB.dat.3
> >> ?--------- ? ?      ?         ?            ? 10MB.dat.5
> >> drwxrwxrwx 1 root   root   4.0K Mar 29 14:06 lost+found
> >>
> >>
> >> This happens both on local disk and on network storage, but it only
> happens if the upgraded file system starts up the first time using directio.
> If it is started with alt-aio as the TroveMethod, everything works as
> expected. It also only happens the first time the file system is started; if
> I stop the server daemons and restart them, everything operates as expected.
> I do have to kill -9 the server deamons, since they will not exit
> gracefully.
> >>
> >> My test is running on RHEL4 U8 i386 with kernel version 2.6.9-89.ELsmp
> with two server nodes and one client. I was unable to recreate the problem
> with a single server.
> >>
> >> I attached verbose server logs from the time the daemon was started
> after the upgrade until the client failed as well as client logs from mount
> until the returned error. The cliffs notes are that one of the servers logs
> as many unstuff requests as we have client retries configured. The client
> fails at the end of the allotted retries. The other server doesn't log
> anythign after starting.
> >>
> >> Has anyone seen anything similar or know what might be going on?
> >>
> >> Bart.
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> Pvfs2-developers mailing list
> >>
> >> [email protected]
> >> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
> >>
> >>
> >>
> >
> >
> > _______________________________________________
> > Pvfs2-developers mailing list
> > [email protected]
> > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>
>

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] Problem upgrading from 2.6 to 2.8

Reply via email to