Can you get a server into this state (where everything works except for > strip size files), turn on verbose logging, and then try to create a big file?

I'd like to see the log file from the metadata server for the file in question. That server is the one that has to come up with the pre-created file handles at that point and must be having a problem. Even if the pre-create requests had failed up until then, it is supposed to eventually sort things out.

thanks,
-Phil

On 04/29/2010 04:51 PM, Bart Taylor wrote:
Yes, it does finish the Trove Migration and print similar messages. The file system responds to requests; I just can't create files larger than one strip size. Once I restart the file system I can, but on first start, they fail.

Bart.



On Thu, Apr 29, 2010 at 1:50 PM, Kevin Harms <[email protected] <mailto:[email protected]>> wrote:

    Bart,

     I think the server should print out when conversion starts and ends.

     examples:
     Trove Migration Started: Ver=2.6.3
     Trove Migration Complete: Ver=2.6.3
     Trove Migration Set: 2.8.1

     Does is get that far?

    kevin

    On Apr 29, 2010, at 1:55 PM, Bart Taylor wrote:

    > Thanks for the information and suggestion Phil. Unfortunately, I
    didn't get a different result after moving that BMI init block. I
    also managed to reproduce this once while leaving the trove method
    to alt-aio although that doesn't seem directly related to the
    direction you were going.
    >
    > Another thing I noticed is that I can create files successfully
    after the upgrade as long as the size is within 64k which is the
    value of my strip_size distribution param. Once the size exceeds
    that value, I start running into this problem again.
    >
    > Does that help shed any more light on my situation?
    >
    > Bart.
    >
    >
    > On Fri, Apr 16, 2010 at 1:39 PM, Phil Carns <[email protected]
    <mailto:[email protected]>> wrote:
    > Sadly none of my test boxes will run 2.6 any more, but I have a
    theory about what the problem might be here.
    >
    > For some background, the pvfs2-server daemon does these steps in
    order (among others): initializes BMI (networking), initializes
    Trove (storage), and then finally starts processing requests.
    >
    > In your case, two extra things are going on:
    >
    > - the trove initialization may take a while, because it has to
    do a conversion of the
    > format for all objects from v. 2.6 to 2.8, especially if it is
    also switching to o_direct format at the same time.
    >
    > - whichever server gets done first is going to immediately
    contact the other servers in order to precreate handles for new
    files (a new feature in 2.8)
    >
    > I'm guessing that one server finished the trove conversion
    before the others and started its pre-create requests.  The other
    servers can't answer yet (because they are still busy with trove),
    but since BMI is already running the incoming precreate requests
    just get queued up on the socket.  When the slow server finally
    does try to service them, the requests are way out of date and
    have since been retried by the fast server.
    >
    > I'm not sure exactly what goes wrong from there, but if that's
    the cause, the solution might be relatively simple.  If you look
    in pvfs2-server.c, you can take the block of code from
    "BMI_initialize(...)" to "*server_status_flag |= SERVER_BMI_INIT;"
    and try moving that whole block to _after_ the
    "*server_status_flag |= SERVER_TROVE_INIT;" line that indicates
    that trove is done.
    >
    > -Phil
    >
    >
    > On 03/30/2010 06:23 PM, Bart Taylor wrote:
    >>
    >> I am having some problems upgrading existing file systems to
    2.8. After I finish the upgrade and start the file system, I
    cannot create files. Simple commands like dd and cp stall until
    they timeout and leave partial dirents like this:
    >>
    >> [bat...@client t]$ dd if=/dev/zero of=/mnt/pvfs28/10MB.dat.6
    bs=1M count=10
    >> dd: writing `/mnt/pvfs28/10MB.dat.6': Connection timed out
    >> 1+0 records in
    >> 0+0 records out
    >> 0 bytes (0 B) copied, 180.839 seconds, 0.0 kB/s
    >>
    >>
    >> [r...@client ~]# ls -alh /mnt/pvfs28/
    >> total 31M
    >> drwxrwxrwt 1 root   root   4.0K Mar 30 11:24 .
    >> drwxr-xr-x 4 root   root   4.0K Mar 23 13:38 ..
    >> -rw-rw-r-- 1 batayl batayl  10M Mar 30 08:44 10MB.dat.1
    >> -rw-rw-r-- 1 batayl batayl  10M Mar 30 08:44 10MB.dat.2
    >> -rw-rw-r-- 1 batayl batayl  10M Mar 30 08:44 10MB.dat.3
    >> ?--------- ? ?      ?         ?            ? 10MB.dat.5
    >> drwxrwxrwx 1 root   root   4.0K Mar 29 14:06 lost+found
    >>
    >>
    >> This happens both on local disk and on network storage, but it
    only happens if the upgraded file system starts up the first time
    using directio. If it is started with alt-aio as the TroveMethod,
    everything works as expected. It also only happens the first time
    the file system is started; if I stop the server daemons and
    restart them, everything operates as expected. I do have to kill
    -9 the server deamons, since they will not exit gracefully.
    >>
    >> My test is running on RHEL4 U8 i386 with kernel version
    2.6.9-89.ELsmp with two server nodes and one client. I was unable
    to recreate the problem with a single server.
    >>
    >> I attached verbose server logs from the time the daemon was
    started after the upgrade until the client failed as well as
    client logs from mount until the returned error. The cliffs notes
    are that one of the servers logs as many unstuff requests as we
    have client retries configured. The client fails at the end of the
    allotted retries. The other server doesn't log anythign after
    starting.
    >>
    >> Has anyone seen anything similar or know what might be going on?
    >>
    >> Bart.
    >>
    >>
    >>
    >>
    >> _______________________________________________
    >> Pvfs2-developers mailing list
    >>
    >> [email protected]
    <mailto:[email protected]>
    >>
    http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
    >>
    >>
    >>
    >
    >
    > _______________________________________________
    > Pvfs2-developers mailing list
    > [email protected]
    <mailto:[email protected]>
    > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers



_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to