All:

The area of the code where we thought more time was being spent than seemed
reasonable was in the metafile dspace create and the local datafile dspace
create contained in the create state machine.  In both of these operations,
the code executes a function called dbpf_dspace_create_store_handle which
does the following:

1.  db->get against BDB to see if the new handle already has a dspace
entry....which it shouldn't and doesn't.
2.  Issue a system call to "access" which tells us if the bstream file for
the given handle already exists...which it doesn't.
3.  db-put against BDB to store the dspace entry for the new handle
4.  inserts into the attribute cache.


In reviewing a more detailed debug log of these functions, I discovered
that most of the time these four operations execute in less than 0.5ms.
When the time is greater than that, the culprit is always the "access" call
alone or the "access" call along with interrupts from the job_timer state
machine.

At this point, I am thinking that there may be a problem with the version
of linux running on the machines.  As noted in my previous email,
2.6.18-308.16.1.el5 is known to have issues with the kernel dcache
mechanism, which leads me to believe there could be other issues as well.

In the morning, I will run the same tests on a newer kernel (RHEL 6.3) and
compare "access" times between the two kernels.

Becky






On Fri, May 31, 2013 at 7:22 PM, Becky Ligon <[email protected]> wrote:

> Thanks, Mike!
>
> I ran some more tests hoping that the null-aio trove method would
> eliminate disk issues, but null-aio, as I just discovered, still allows
> files to be created. Doh!  So, I will be looking more in depth at our file
> creation process which includes metadata updates and file creation on the
> disk.
>
> BTW:  I noticed that you are running 2.6.18-308.16.1.el5.584g0000
> on your servers and there is a known Linux bug concerning dcache
> processing that creates a kernel panic when OrangeFS is unmounted.  This
> bug effects other software, too, not just ours.  Have you had any problems
> along these lines?  Our recommendation for those who want to stay on RHEL 5
> is to use 2.6.18-308.
>
> Becky
>
>
>
> On Fri, May 31, 2013 at 6:33 PM, Michael Robbert <[email protected]>wrote:
>
>> Yes, please do. You have free reign on the nodes that I listed in my
>> Email to you until this problem is solved.
>>
>> Thanks,
>> Mike
>>
>>
>> On 5/31/13 4:23 PM, Becky Ligon wrote:
>>
>>> Mike:
>>>
>>> Thanks for letting us onto your system.
>>>
>>> We ran some more tests and it seems that file creation during the touch
>>> command is taking more time than it should, while metadata ops seem
>>> okay.   I dumped some more OFS debug data and will be looking at it over
>>> the weekend.  I want to pinpoint the precise places in the code that I
>>> *think* are taking time and then rerun more tests.  This may mean
>>> putting up a new copy of OFS with more specific debugging in it, if that
>>> is okay with you.  I also have more ideas on other tests that we can run
>>> to verify where the problem is occurring.
>>>
>>> Is it okay if I log onto your system over the weekend?
>>>
>>> Becky
>>>
>>>
>>> On Fri, May 31, 2013 at 3:24 PM, Becky Ligon <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>>     Mike:
>>>
>>>      From the data you just sent, we see spikes in the touches as well
>>>     as the removes, with the removes being more frequent.
>>>
>>>     For example, on the rm data, there is a spike of about 2 orders of
>>>     magnitude (100x) about every 10 ops, which can result in a 10x
>>>     average slow down, even though most of the operations finish quite
>>>     quickly.  We do not normally see this, and we don't see it on our
>>>     systems here, so we are trying to decide what might cause this so we
>>>     can direct our efforts.
>>>
>>>     At this point, we are trying to further diagnose the problem.  Would
>>>     it be possible for us to log onto your system to look around and
>>>     possibly run some more tests?
>>>
>>>     I am sorry for the inconvenience this is causing, but rest assured,
>>>     several of us developers are trying to figure out the difference in
>>>     performance between your system and ours.  (We haven't been able to
>>>     recreate your problem as of yet.)
>>>
>>>
>>>     Becky
>>>
>>>
>>>
>>>     On Fri, May 31, 2013 at 2:34 PM, Michael Robbert <[email protected]
>>>     <mailto:[email protected]>> wrote:
>>>
>>>         My terminal buffers weren't big enough to copy and paste all of
>>>         that output, but hopefully the attached will have enough info
>>>         for you to get an idea of what I'm seeing.
>>>         I am beginning to feel like we're just running around in circles
>>>         here. I can do these kinds of tests with and without cache until
>>>         I'm blue in the face, but nothing is going to change until we
>>>         figure out why un-cached meta data access is so slow. What are
>>>         we doing to track that down?
>>>
>>>         Thanks,
>>>         Mike
>>>
>>>
>>>         On 5/31/13 12:05 PM, Becky Ligon wrote:
>>>
>>>             Mike:
>>>
>>>             There is something going on with your system, as I am able
>>>             to touch 500
>>>             files in 12.5 seconds and delete them in 8.8 seconds on our
>>>             cluster.
>>>
>>>             Did you remove all of ATTR entries from your conf file and
>>>             restart the
>>>             servers?
>>>
>>>             If not, please do so and then capture the output from the
>>>             following and
>>>             send it to me:
>>>
>>>             for i in `seq 1 500`; do time touch myfile${i}; done
>>>
>>>             and then
>>>
>>>             for i in myfile*; do time rm -f ${i}; done.
>>>
>>>
>>>             Thanks,
>>>             Becky
>>>
>>>
>>>             On Fri, May 31, 2013 at 12:02 PM, Michael Robbert
>>>             <[email protected] <mailto:[email protected]>
>>>             <mailto:[email protected] <mailto:[email protected]>>>
>>> wrote:
>>>
>>>                  top - 09:54:53 up 6 days, 19:11,  1 user,  load
>>>             average: 0.00, 0.00,
>>>                  0.00
>>>                  Tasks: 156 total,   1 running, 155 sleeping,   0
>>>             stopped,   0 zombie
>>>                  Cpu(s):  0.1%us,  0.2%sy,  0.0%ni, 99.8%id,  0.0%wa,
>>>               0.0%hi,
>>>                    0.0%si, 0.0%st
>>>                  Mem:  12289220k total,  1322196k used, 10967024k free,
>>>                 85820k buffers
>>>                  Swap:  2104432k total,      232k used,  2104200k free,
>>>                965636k cached
>>>
>>>                  They all look very similar to this. 232k swap used on
>>>             all of them
>>>                  throughout a touch/rm of 100 files. Ganglia doesn't
>>>             show any change
>>>                  over time with cache on or off.
>>>
>>>                  Mike
>>>
>>>
>>>                  On 5/31/13 9:30 AM, Becky Ligon wrote:
>>>
>>>                      Michael:
>>>
>>>                      Can you send me a screen shot of "top" from your
>>>             servers when the
>>>                      metadata is running on the local disk?  I'd like to
>>>             see how much
>>>                      memory
>>>                      is available.  I'm wondering if 1GB for your DB
>>>             cache is too high,
>>>                      possibly causing excessive swapping.
>>>
>>>                      Becky
>>>
>>>
>>>                      On Fri, May 24, 2013 at 6:06 PM, Michael Robbert
>>>                      <[email protected] <mailto:[email protected]>
>>>             <mailto:[email protected] <mailto:[email protected]>>
>>>                      <mailto:[email protected]
>>>             <mailto:[email protected]> <mailto:[email protected]
>>>             <mailto:[email protected]>>>> wrote:
>>>
>>>                           We recently noticed a performance problem with
>>>             our OrangeFS
>>>                      server.
>>>
>>>                           Here are the server stats:
>>>                           3 servers, built identically with identical
>>>             hardware
>>>
>>>                           [root@orangefs02 ~]# /usr/sbin/pvfs2-server
>>>             --version
>>>                           2.8.7-orangefs (mode: aio-threaded)
>>>
>>>                           [root@orangefs02 ~]# uname -r
>>>                           2.6.18-308.16.1.el5.584g0000
>>>
>>>                           4 core E5603 1.60GHz
>>>                           12GB of RAM
>>>
>>>                           OrangeFS is being served to clients using
>>>             bmi_tcp over DDR
>>>                      Infiniband.
>>>                           Backend storage is PanFS with 2x10Gig
>>>             connections on the
>>>                      servers.
>>>                           Performance to the backend looks fine using
>>>             bonnie++.
>>>                       >100MB/sec
>>>                           write and ~250MB/s read to each stack. ~300
>>>             creates/sec.
>>>
>>>                           On the OrangeFS clients are running kernel
>>> version
>>>                      2.6.18-238.19.1.el5.
>>>
>>>                           The biggest problem I have right now is that
>>>             delete are
>>>                      taking a
>>>                           long time. Almost 1 sec per file.
>>>
>>>                           [root@fatcompute-11-32
>>>                      L_10_V0.2_eta0.3_wRes_______**truncerr1e-11]# find
>>>
>>>
>>>                           N2/|wc -l
>>>                           137
>>>                           [root@fatcompute-11-32
>>>                      L_10_V0.2_eta0.3_wRes_______**truncerr1e-11]# time
>>>
>>>
>>>
>>>                           rm -rf N2
>>>
>>>                           real    1m31.096s
>>>                           user    0m0.000s
>>>                           sys     0m0.015s
>>>
>>>                           Similar results for file creates:
>>>
>>>                           [root@fatcompute-11-32 ]# date;for i in `seq 1
>>>             50`;do touch
>>>                           file${i};done;date
>>>                           Fri May 24 16:04:17 MDT 2013
>>>                           Fri May 24 16:05:05 MDT 2013
>>>
>>>                           What else do you need to know? Which debug
>>>             flags? What
>>>                      should we be
>>>                           looking at?
>>>                           I don't see any load on the servers and I've
>>>             restarted
>>>                      server and
>>>                           rebooted server nodes.
>>>
>>>                           Thanks for any pointers,
>>>                           Mike Robbert
>>>                           Colorado School of Mines
>>>
>>>
>>>
>>>               ______________________________**_____________________
>>>                           Pvfs2-users mailing list
>>>             
>>> Pvfs2-users@beowulf-____**underground.org<Pvfs2-users@beowulf-____underground.org>
>>>             
>>> <mailto:Pvfs2-users@beowulf-__**underground.org<Pvfs2-users@beowulf-__underground.org>
>>> >
>>>                      
>>> <mailto:Pvfs2-users@beowulf-__**underground.org<Pvfs2-users@beowulf-__underground.org>
>>>             
>>> <mailto:Pvfs2-users@beowulf-**underground.org<[email protected]>
>>> >>
>>>
>>>               
>>> <mailto:Pvfs2-users@beowulf-__**__underground.org<Pvfs2-users@beowulf-____underground.org>
>>>             
>>> <mailto:Pvfs2-users@beowulf-__**underground.org<Pvfs2-users@beowulf-__underground.org>
>>> >
>>>                      
>>> <mailto:Pvfs2-users@beowulf-__**underground.org<Pvfs2-users@beowulf-__underground.org>
>>>             
>>> <mailto:Pvfs2-users@beowulf-**underground.org<[email protected]>
>>> >>>
>>>
>>>             http://www.beowulf-____**underground.org/mailman/____**
>>> listinfo/pvfs2-users<http://www.beowulf-____underground.org/mailman/____listinfo/pvfs2-users>
>>>             <http://www.beowulf-__**underground.org/mailman/__**
>>> listinfo/pvfs2-users<http://www.beowulf-__underground.org/mailman/__listinfo/pvfs2-users>
>>> >
>>>
>>>
>>>
>>>             <http://www.beowulf-__**underground.org/mailman/__**
>>> listinfo/pvfs2-users<http://www.beowulf-__underground.org/mailman/__listinfo/pvfs2-users>
>>>             <http://www.beowulf-**underground.org/mailman/**
>>> listinfo/pvfs2-users<http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users>
>>> >>
>>>
>>>
>>>
>>>
>>>                      --
>>>                      Becky Ligon
>>>                      OrangeFS Support and Development
>>>                      Omnibond Systems
>>>                      Anderson, South Carolina
>>>
>>>
>>>
>>>
>>>
>>>             --
>>>             Becky Ligon
>>>             OrangeFS Support and Development
>>>             Omnibond Systems
>>>             Anderson, South Carolina
>>>
>>>
>>>
>>>
>>>     --
>>>     Becky Ligon
>>>     OrangeFS Support and Development
>>>     Omnibond Systems
>>>     Anderson, South Carolina
>>>
>>>
>>>
>>>
>>> --
>>> Becky Ligon
>>> OrangeFS Support and Development
>>> Omnibond Systems
>>> Anderson, South Carolina
>>>
>>>
>>
>
>
> --
> Becky Ligon
> OrangeFS Support and Development
> Omnibond Systems
> Anderson, South Carolina
>
>


-- 
Becky Ligon
OrangeFS Support and Development
Omnibond Systems
Anderson, South Carolina
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to