While we're waiting to hear your full findings/report we're going to attempt to debug PanFS to see if there is a problem with our system or if this is a lost cause. If you have any suggestions for tests we can run against PanFS that would be appreciated. We'll also be looking into alternative hardware that can be used to present real LUNs to the servers. I don't know if we have budget for that, but at least we'll know what it costs to fix the problem for good.

Thanks,
Mike Robbert

On 6/3/13 12:58 PM, Becky Ligon wrote:

Mike:

When I run the same "touch" test using local storage as the metadata and
data stores, I get great response, paralleling what I get on our own
cluster.  So, the kernel version doesn't seem to make a difference where
the "access" system call is concerned.

I ran some tests last night where I removed the system call to "access",
which removes the calls to  PANFS, and I got great response. The
problem, therefore, appears to be running the system call "access"
against PANFS.  The Berkeley DB has nothing to do with your issue.

Let me discuss my findings with the team and get back with you on this.

BTW, I wasn't able to restart your servers using the crm command. Can
you see what's going on with that?

Thanks so much for your time and patience!

Becky


On Mon, Jun 3, 2013 at 11:43 AM, Becky Ligon <[email protected]
<mailto:[email protected]>> wrote:

    Let me run my last tests on orangefs01-ib0 to see if it is really
    the kernel or not.

    Becky


    On Mon, Jun 3, 2013 at 11:30 AM, Michael Robbert <[email protected]
    <mailto:[email protected]>> wrote:

        I misspoke slightly in that last Email. I think that the kernel
        version numbers that we're tied to are 2.6.18.*. Not just -308,
        We're still running 2.6.18-275.12.1.el5.573g0000 on our other
        system so we can try that if you'd like.

        Thanks,
        Mike


        On 6/3/13 9:16 AM, Michael Robbert wrote:

            We are confined to kernels from Scyld Clusterware in the
            2.6.18-308.*
            range. Our PanFS modules were purchased as a one time deal
            to get it to
            work with Scyld 5.x. They put in  some work to make it
            version number
            independent, but I've tried non-scyld and other versions of
            Scyld and it
            doesn't work.

            Mike

            On 6/2/13 8:50 PM, Becky Ligon wrote:

                All:

                The area of the code where we thought more time was
                being spent than
                seemed reasonable was in the metafile dspace create and
                the local
                datafile dspace create contained in the create state
                machine.  In both
                of these operations, the code executes a function called
                dbpf_dspace_create_store___handle which does the following:

                1.  db->get against BDB to see if the new handle already
                has a dspace
                entry....which it shouldn't and doesn't.
                2.  Issue a system call to "access" which tells us if
                the bstream file
                for the given handle already exists...which it doesn't.
                3.  db-put against BDB to store the dspace entry for the
                new handle
                4.  inserts into the attribute cache.


                In reviewing a more detailed debug log of these
                functions, I discovered
                that most of the time these four operations execute in
                less than 0.5ms.
                When the time is greater than that, the culprit is
                always the "access"
                call alone or the "access" call along with interrupts
                from the job_timer
                state machine.

                At this point, I am thinking that there may be a problem
                with the
                version of linux running on the machines.  As noted in
                my previous
                email, 2.6.18-308.16.1.el5 is known to have issues with
                the kernel
                dcache mechanism, which leads me to believe there could
                be other issues
                as well.

                In the morning, I will run the same tests on a newer
                kernel (RHEL 6.3)
                and compare "access" times between the two kernels.

                Becky






                On Fri, May 31, 2013 at 7:22 PM, Becky Ligon
                <[email protected] <mailto:[email protected]>
                <mailto:[email protected] <mailto:[email protected]>>>
                wrote:

                     Thanks, Mike!

                     I ran some more tests hoping that the null-aio
                trove method would
                     eliminate disk issues, but null-aio, as I just
                discovered, still
                     allows files to be created. Doh!  So, I will be
                looking more in
                     depth at our file creation process which includes
                metadata updates
                     and file creation on the disk.

                     BTW:  I noticed that you are running
                2.6.18-308.16.1.el5.584g0000
                     on your servers and there is a known Linux bug
                concerning dcache
                     processing that creates a kernel panic when
                OrangeFS is unmounted.
                     This bug effects other software, too, not just
                ours.  Have you had
                     any problems along these lines?  Our recommendation
                for those who
                     want to stay on RHEL 5 is to use 2.6.18-308.

                     Becky



                     On Fri, May 31, 2013 at 6:33 PM, Michael Robbert
                <[email protected] <mailto:[email protected]>
                     <mailto:[email protected]
                <mailto:[email protected]>>> wrote:

                         Yes, please do. You have free reign on the
                nodes that I listed
                         in my Email to you until this problem is solved.

                         Thanks,
                         Mike


                         On 5/31/13 4:23 PM, Becky Ligon wrote:

                             Mike:

                             Thanks for letting us onto your system.

                             We ran some more tests and it seems that
                file creation
                             during the touch
                             command is taking more time than it should,
                while metadata
                             ops seem
                             okay.   I dumped some more OFS debug data
                and will be
                             looking at it over
                             the weekend.  I want to pinpoint the
                precise places in the
                             code that I
                             *think* are taking time and then rerun more
                tests.  This may
                             mean
                             putting up a new copy of OFS with more
                specific debugging in
                             it, if that
                             is okay with you.  I also have more ideas
                on other tests
                             that we can run
                             to verify where the problem is occurring.

                             Is it okay if I log onto your system over
                the weekend?

                             Becky


                             On Fri, May 31, 2013 at 3:24 PM, Becky Ligon
                             <[email protected]
                <mailto:[email protected]> <mailto:[email protected]
                <mailto:[email protected]>>
                             <mailto:[email protected]
                <mailto:[email protected]> <mailto:[email protected]
                <mailto:[email protected]>>>>
                wrote:

                                  Mike:

                                   From the data you just sent, we see
                spikes in the
                             touches as well
                                  as the removes, with the removes being
                more frequent.

                                  For example, on the rm data, there is
                a spike of about
                             2 orders of
                                  magnitude (100x) about every 10 ops,
                which can result
                             in a 10x
                                  average slow down, even though most of
                the operations
                             finish quite
                                  quickly.  We do not normally see this,
                and we don't see
                             it on our
                                  systems here, so we are trying to
                decide what might
                             cause this so we
                                  can direct our efforts.

                                  At this point, we are trying to
                further diagnose the
                             problem.  Would
                                  it be possible for us to log onto your
                system to look
                             around and
                                  possibly run some more tests?

                                  I am sorry for the inconvenience this
                is causing, but
                             rest assured,
                                  several of us developers are trying to
                figure out the
                             difference in
                                  performance between your system and
                ours.  (We haven't
                             been able to
                                  recreate your problem as of yet.)


                                  Becky



                                  On Fri, May 31, 2013 at 2:34 PM,
                Michael Robbert
                             <[email protected]
                <mailto:[email protected]> <mailto:[email protected]
                <mailto:[email protected]>>
                                  <mailto:[email protected]
                <mailto:[email protected]>
                             <mailto:[email protected]
                <mailto:[email protected]>>>> wrote:

                                      My terminal buffers weren't big
                enough to copy and
                             paste all of
                                      that output, but hopefully the
                attached will have
                             enough info
                                      for you to get an idea of what I'm
                seeing.
                                      I am beginning to feel like we're
                just running
                             around in circles
                                      here. I can do these kinds of
                tests with and
                             without cache until
                                      I'm blue in the face, but nothing
                is going to
                             change until we
                                      figure out why un-cached meta data
                access is so
                             slow. What are
                                      we doing to track that down?

                                      Thanks,
                                      Mike


                                      On 5/31/13 12:05 PM, Becky Ligon
                wrote:

                                          Mike:

                                          There is something going on
                with your system,
                             as I am able
                                          to touch 500
                                          files in 12.5 seconds and
                delete them in 8.8
                             seconds on our
                                          cluster.

                                          Did you remove all of ATTR
                entries from your
                             conf file and
                                          restart the
                                          servers?

                                          If not, please do so and then
                capture the
                             output from the
                                          following and
                                          send it to me:

                                          for i in `seq 1 500`; do time
                touch myfile${i};
                             done

                                          and then

                                          for i in myfile*; do time rm
                -f ${i}; done.


                                          Thanks,
                                          Becky


                                          On Fri, May 31, 2013 at 12:02
                PM, Michael
                Robbert
                                          <[email protected]
                <mailto:[email protected]> <mailto:[email protected]
                <mailto:[email protected]>>
                             <mailto:[email protected]
                <mailto:[email protected]> <mailto:[email protected]
                <mailto:[email protected]>>>
                                          <mailto:[email protected]
                <mailto:[email protected]>
                             <mailto:[email protected]
                <mailto:[email protected]>> <mailto:[email protected]
                <mailto:[email protected]>
                             <mailto:[email protected]
                <mailto:[email protected]>>>>> wrote:

                                               top - 09:54:53 up 6 days,
                19:11,  1 user,
                               load
                                          average: 0.00, 0.00,
                                               0.00
                                               Tasks: 156 total,   1
                running, 155
                             sleeping,   0
                                          stopped,   0 zombie
                                               Cpu(s):  0.1%us,  0.2%sy,
                  0.0%ni,
                             99.8%id,  0.0%wa,
                                            0.0%hi,
                                                 0.0%si, 0.0%st
                                               Mem:  12289220k total,
                  1322196k used,
                             10967024k free,
                                              85820k buffers
                                               Swap:  2104432k total,
                    232k used,
                               2104200k free,
                                             965636k cached

                                               They all look very
                similar to this. 232k
                             swap used on
                                          all of them
                                               throughout a touch/rm of
                100 files.
                             Ganglia doesn't
                                          show any change
                                               over time with cache on
                or off.

                                               Mike


                                               On 5/31/13 9:30 AM, Becky
                Ligon wrote:

                                                   Michael:

                                                   Can you send me a
                screen shot of "top"
                             from your
                                          servers when the
                                                   metadata is running
                on the local disk?
                               I'd like to
                                          see how much
                                                   memory
                                                   is available.  I'm
                wondering if 1GB
                             for your DB
                                          cache is too high,
                                                   possibly causing
                excessive swapping.

                                                   Becky


                                                   On Fri, May 24, 2013
                at 6:06 PM,
                             Michael Robbert
                                                   <[email protected]
                <mailto:[email protected]>
                             <mailto:[email protected]
                <mailto:[email protected]>> <mailto:[email protected]
                <mailto:[email protected]>
                             <mailto:[email protected]
                <mailto:[email protected]>>>
                                          <mailto:[email protected]
                <mailto:[email protected]>
                             <mailto:[email protected]
                <mailto:[email protected]>> <mailto:[email protected]
                <mailto:[email protected]>
                             <mailto:[email protected]
                <mailto:[email protected]>>>>

                <mailto:[email protected] <mailto:[email protected]>
                             <mailto:[email protected]
                <mailto:[email protected]>>
                                          <mailto:[email protected]
                <mailto:[email protected]>
                             <mailto:[email protected]
                <mailto:[email protected]>>> <mailto:[email protected]
                <mailto:[email protected]>
                             <mailto:[email protected]
                <mailto:[email protected]>>
                                          <mailto:[email protected]
                <mailto:[email protected]>
                             <mailto:[email protected]
                <mailto:[email protected]>>>>>__> wrote:

                                                        We recently
                noticed a performance
                             problem with
                                          our OrangeFS
                                                   server.

                                                        Here are the
                server stats:
                                                        3 servers, built
                identically with
                             identical
                                          hardware

                                                        [root@orangefs02 ~]#
                             /usr/sbin/pvfs2-server
                                          --version
                                                        2.8.7-orangefs
                (mode:
                aio-threaded)

                                                        [root@orangefs02
                ~]# uname -r

                  2.6.18-308.16.1.el5.584g0000

                                                        4 core E5603 1.60GHz
                                                        12GB of RAM

                                                        OrangeFS is
                being served to
                             clients using
                                          bmi_tcp over DDR
                                                   Infiniband.
                                                        Backend storage
                is PanFS with
                2x10Gig
                                          connections on the
                                                   servers.
                                                        Performance to
                the backend looks
                             fine using
                                          bonnie++.
                                                    >100MB/sec
                                                        write and
                ~250MB/s read to each
                             stack. ~300
                                          creates/sec.

                                                        On the OrangeFS
                clients are
                             running kernel version
                                                   2.6.18-238.19.1.el5.

                                                        The biggest
                problem I have right
                             now is that
                                          delete are
                                                   taking a
                                                        long time.
                Almost 1 sec per file.


                  [root@fatcompute-11-32


                L_10_V0.2_eta0.3_wRes___________truncerr1e-11]# find


                                                        N2/|wc -l
                                                        137

                  [root@fatcompute-11-32


                L_10_V0.2_eta0.3_wRes___________truncerr1e-11]# time



                                                        rm -rf N2

                                                        real    1m31.096s
                                                        user    0m0.000s
                                                        sys     0m0.015s

                                                        Similar results
                for file creates:


                  [root@fatcompute-11-32 ]#
                             date;for i in `seq 1
                                          50`;do touch
                                                        file${i};done;date
                                                        Fri May 24
                16:04:17 MDT 2013
                                                        Fri May 24
                16:05:05 MDT 2013

                                                        What else do you
                need to know?
                             Which debug
                                          flags? What
                                                   should we be
                                                        looking at?
                                                        I don't see any
                load on the
                             servers and I've
                                          restarted
                                                   server and
                                                        rebooted server
                nodes.

                                                        Thanks for any
                pointers,
                                                        Mike Robbert
                                                        Colorado School
                of Mines





                _______________________________________________________
                                                        Pvfs2-users
                mailing list
                Pvfs2-users@beowulf-________underground.org
                <mailto:Pvfs2-users@beowulf-______underground.org>

                <mailto:Pvfs2-users@beowulf-______underground.org
                <mailto:Pvfs2-users@beowulf-____underground.org>>

                  <mailto:Pvfs2-users@beowulf-______underground.org
                <mailto:Pvfs2-users@beowulf-____underground.org>

                <mailto:Pvfs2-users@beowulf-____underground.org
                <mailto:Pvfs2-users@beowulf-__underground.org>>>


                <mailto:Pvfs2-users@beowulf-______underground.org
                <mailto:Pvfs2-users@beowulf-____underground.org>

                <mailto:Pvfs2-users@beowulf-____underground.org
                <mailto:Pvfs2-users@beowulf-__underground.org>>

                  <mailto:Pvfs2-users@beowulf-____underground.org
                <mailto:Pvfs2-users@beowulf-__underground.org>

                <mailto:Pvfs2-users@beowulf-__underground.org
                <mailto:[email protected]>>>>



                <mailto:Pvfs2-users@beowulf-________underground.org
                <mailto:Pvfs2-users@beowulf-______underground.org>

                <mailto:Pvfs2-users@beowulf-______underground.org
                <mailto:Pvfs2-users@beowulf-____underground.org>>

                  <mailto:Pvfs2-users@beowulf-______underground.org
                <mailto:Pvfs2-users@beowulf-____underground.org>

                <mailto:Pvfs2-users@beowulf-____underground.org
                <mailto:Pvfs2-users@beowulf-__underground.org>>>


                <mailto:Pvfs2-users@beowulf-______underground.org
                <mailto:Pvfs2-users@beowulf-____underground.org>

                <mailto:Pvfs2-users@beowulf-____underground.org
                <mailto:Pvfs2-users@beowulf-__underground.org>>

                  <mailto:Pvfs2-users@beowulf-____underground.org
                <mailto:Pvfs2-users@beowulf-__underground.org>

                <mailto:Pvfs2-users@beowulf-__underground.org
                <mailto:[email protected]>>>>>


                
http://www.beowulf-________underground.org/mailman/________listinfo/pvfs2-users
                
<http://www.beowulf-______underground.org/mailman/______listinfo/pvfs2-users>


                
<http://www.beowulf-______underground.org/mailman/______listinfo/pvfs2-users
                
<http://www.beowulf-____underground.org/mailman/____listinfo/pvfs2-users>>


                
<http://www.beowulf-______underground.org/mailman/______listinfo/pvfs2-users
                
<http://www.beowulf-____underground.org/mailman/____listinfo/pvfs2-users>

                
<http://www.beowulf-____underground.org/mailman/____listinfo/pvfs2-users
                
<http://www.beowulf-__underground.org/mailman/__listinfo/pvfs2-users>>>





                
<http://www.beowulf-______underground.org/mailman/______listinfo/pvfs2-users
                
<http://www.beowulf-____underground.org/mailman/____listinfo/pvfs2-users>

                
<http://www.beowulf-____underground.org/mailman/____listinfo/pvfs2-users
                
<http://www.beowulf-__underground.org/mailman/__listinfo/pvfs2-users>>


                
<http://www.beowulf-____underground.org/mailman/____listinfo/pvfs2-users
                
<http://www.beowulf-__underground.org/mailman/__listinfo/pvfs2-users>

                
<http://www.beowulf-__underground.org/mailman/__listinfo/pvfs2-users
                
<http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users>>>>




                                                   --
                                                   Becky Ligon
                                                   OrangeFS Support and
                Development
                                                   Omnibond Systems
                                                   Anderson, South Carolina





                                          --
                                          Becky Ligon
                                          OrangeFS Support and Development
                                          Omnibond Systems
                                          Anderson, South Carolina




                                  --
                                  Becky Ligon
                                  OrangeFS Support and Development
                                  Omnibond Systems
                                  Anderson, South Carolina




                             --
                             Becky Ligon
                             OrangeFS Support and Development
                             Omnibond Systems
                             Anderson, South Carolina





                     --
                     Becky Ligon
                     OrangeFS Support and Development
                     Omnibond Systems
                     Anderson, South Carolina




                --
                Becky Ligon
                OrangeFS Support and Development
                Omnibond Systems
                Anderson, South Carolina




            _________________________________________________
            Pvfs2-users mailing list
            Pvfs2-users@beowulf-__underground.org
            <mailto:[email protected]>
            http://www.beowulf-__underground.org/mailman/__listinfo/pvfs2-users
            <http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users>



        _______________________________________________
        Pvfs2-users mailing list
        [email protected]
        <mailto:[email protected]>
        http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users




    --
    Becky Ligon
    OrangeFS Support and Development
    Omnibond Systems
    Anderson, South Carolina




--
Becky Ligon
OrangeFS Support and Development
Omnibond Systems
Anderson, South Carolina


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to