Re: Missing performance improvement at parallel execution

Michael Dietrich Thu, 28 Aug 2014 08:51:08 -0700

Hello Lydia,

Thank you for your hint about using the "in" intent. Unfortunately it  
didn't improve the performance but may anyway be senseful to know this.


In one mail Ben told be a reason about the issue: There was an array  
assignment which implicitly fires a lot of forall loops causing bad  
performance. After fixing it the parallel and serial executions need  
around the same time.

Maybe there is a possible solution to make the parallel Barnes Hut act  
around [number of Processors]-times faster. With the simple algorithm  
this is already the case.

I uploaded the updated code of my implementation. Either packed [1] or  
with directory view [2].

By the way: If you acually wonder why the name of the program is  
"fish": I don't know either. ;)

bye

[1] http://www-user.tu-chemnitz.de/~michd/fish.tar.gz
[2] http://www-user.tu-chemnitz.de/~michd/fish/


Zitat von Lydia Duncan <[email protected]>:

> Hello,
>
> It sounds like you want a copy of the data rather than a pure  
> reference to it for the individual calls, so perhaps what you are  
> looking for are different intents used on the arguments to the  
> calculation function or its children?  If the function/task calling  
> the function should be working on its own copy of the data and no  
> other calls will utilize that data, then you may want to specify  
> that the data has an "in" intent, meaning you want to copy that data  
> into the function but don't care about what happens to it within the  
> function after it has exited.  If you expect other calls to use the  
> data afterwards, but not until everything has completed, consider  
> using an "inout" intent, which makes an independent copy of the data  
> for use in the function and then copies its end state back to the  
> caller's representation of the data.  Note that if these intents are  
> not used carefully, unnecessary copies may be made or made  
> repeatedly, making your code slower than before!
>
> An example use of the "in" intent on a simple value:
>
> var foo = 3;
> writeln(foo);
> useMe(foo);
> writeln(foo);
>
> proc useMe(in arg: int) {
>     arg = arg + 3;
>     writeln(arg);
>     return;
> }
>
> Running that simple program will generate:
>
> 3
> 6
> 3
>
> as output.  Modifying the intent so that the function header reads:
>
> proc useMe(inout arg: int) {
>
> will instead generate:
>
> 3
> 6
> 6
>
> Note that using ref (as you had previously done) instead of inout  
> will result in the same output, but may have a performance penalty  
> or improvement depending on the situation.  Because what you are  
> using are classes, specifying ref should not be necessary (as Ben  
> said earlier in this conversation).
>
> Hope that helps!
> Lydia
>
> On 08/27/2014 07:40 AM, Michael Dietrich wrote:
>> Hello,
>>
>> I have a theory for that misbehave in my program:
>>
>> In the simple algorithm the tasks do access to every particle's  
>> data once per step but do this mostly reading, so there is no  
>> possible conflict. The writing part happens only in the end of  
>> every step independently to the other particles. There can no  
>> problems occur thus it works as expected.
>>
>> However the Barnes Hut algorithm is based on a tree-structure. It  
>> is initialized and assembled always by the first task. So the flaw  
>> must be in the forall loop when the tree is used for calculating  
>> the force. In the solution in MPI every process holds its' own  
>> tree, but in Chapel with several tasks on one computer which use  
>> the same tree it may be inevitable that the algorithm becomes  
>> slower instead of faster.
>>
>> Are these thoughts going into the right direction? How may this  
>> issue be dealt with?
>>
>> bye
>>
>>
>> Zitat von Michael Dietrich <[email protected]>:
>>
>>> Hi,
>>>
>>> Thank you for your hint. I'm aware of that, but the calc_force
>>> algorithm is executed within every program step so this part does not
>>> (only) initialize but also reset the values.
>>>
>>> Tomorrow I will try the zipped iteration... please note in my place
>>> the time is nine hours later. ;)
>>>
>>> Today I tried to run Chapel on more than one locale. With a simple
>>> algorithm it worked very fine but with my particle simulation... let's
>>> find a solution for the first problem I contaced you with, before I
>>> get to the next one. :)
>>>
>>> bye
>>>
>>>
>>> Zitat von Lydia Duncan <[email protected]>:
>>>
>>>> Hi Michael,
>>>>
>>>> One thing to note - Chapel variables are initialized by default.
>>>> Since you provided a type for the fields of a particle, you don't
>>>> need to set their values to 0.0 as that is done automatically.
>>>>
>>>> Lydia
>>>>
>>>> On 08/20/2014 09:58 AM, Ben Harshbarger wrote:
>>>>> Hi Michael,
>>>>>
>>>>> I think you’re going to want a zippered iteration:
>>>>>
>>>>> forall (part, i) in zip(field.p, field.D) { … }
>>>>>
>>>>> -Ben
>>>>>
>>>>> On 8/20/14, 7:38 AM, "Michael Dietrich"
>>>>> <[email protected]> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> yes, you're right, I continued using the C-syntax. Now it looks much
>>>>>> better.
>>>>>>
>>>>>> One question about this to make it perfect: If i write...
>>>>>>
>>>>>> forall part in field.p {
>>>>>>  part.f.x = 0.0;
>>>>>>  part.f.y = 0.0;
>>>>>>  part.f.z = 0.0;
>>>>>> }
>>>>>>
>>>>>> ...how can I get the index of "part" within the array "p"?
>>>>>>
>>>>>> bye
>>>>>>
>>>>>>
>>>>>> Zitat von Ben Harshbarger <[email protected]>:
>>>>>>
>>>>>>> Hi Michael,
>>>>>>>
>>>>>>> Looking through your code I saw that you were passing classes as
>>>>>>> arguments
>>>>>>> with the ‘ref’ intent. Chapel classes are passed as references by
>>>>>>> default,
>>>>>>> so I don’t think you need the ref intent in most of those functions:
>>>>>>>
>>>>>>> // this should be fine
>>>>>>> proc calc_force(parts : ParticleField) { … }
>>>>>>>
>>>>>>> Something else you can do to improve your code’s performance and
>>>>>>> readability is to iterate directly over an array instead of its
>>>>>>> underlying
>>>>>>> domain:
>>>>>>>
>>>>>>> forall part in field.p {
>>>>>>>  part.f.x = 0.0;
>>>>>>>  part.f.y = 0.0;
>>>>>>>  part.f.z = 0.0;
>>>>>>> }
>>>>>>>
>>>>>>> // instead of
>>>>>>>
>>>>>>> forall i in 0..field.num_parts-1 {
>>>>>>>  field.p[i].f.x = 0.0;
>>>>>>>  field.p[i].f.y = 0.0;
>>>>>>>  field.p[i].f.z = 0.0;
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Let me know if you have any additional questions.
>>>>>>>
>>>>>>> -Ben
>>>>>>>
>>>>>>>
>>>>>>> On 8/18/14, 11:18 AM, "Michael Dietrich"
>>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> okay, if you want to you can read through the code [1]. :)
>>>>>>>> For comparation it includes the program in C which I successfully
>>>>>>>> parallelized with MPI and the same program in Chapel.
>>>>>>>> I hope I didn't forget to comment important lines. If anything is
>>>>>>>> unclear just ask me.
>>>>>>>>
>>>>>>>> Please consider the code won't win a beauty contest (especially the
>>>>>>>> one in C) but at least my Chapel code is more readable than the code
>>>>>>>> in C mostly written by the responsible employee. ;)
>>>>>>>>
>>>>>>>> bye
>>>>>>>>
>>>>>>>> [1] http://www-user.tu-chemnitz.de/~michd/fish.tar.gz
>>>>>>>>
>>>>>>>>
>>>>>>>> Zitat von Lydia Duncan <[email protected]>:
>>>>>>>>
>>>>>>>>> Hi Michael,
>>>>>>>>>
>>>>>>>>> I think it will be difficult for us to offer further suggestions
>>>>>>>>> without looking at the code ourselves, would you be comfortable
>>>>>>>>> sending it?
>>>>>>>>>
>>>>>>>>> Lydia
>>>>>>>>>
>>>>>>>>> On 08/18/2014 05:27 AM, Michael Dietrich wrote:
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> today I compiled my program with the --fast flag.
>>>>>>>>>> Though it made both the serial execution much faster than  
>>>>>>>>>> before, the
>>>>>>>>>> parallel one became even slower compared with the serial. I'm still
>>>>>>>>>> trying to find any possible mistakes within my code but I can't find
>>>>>>>>>> any. Any suggestions?
>>>>>>>>>>
>>>>>>>>>> bye
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Zitat von Tom MacDonald <[email protected]>:
>>>>>>>>>>
>>>>>>>>>>> Hi Michael,
>>>>>>>>>>>
>>>>>>>>>>> Without --fast the compiler generates runtime checks, which slows
>>>>>>>>>>> down execution time.
>>>>>>>>>>>
>>>>>>>>>>> Please compile with the --fast flag and see how much execution
>>>>>>>>>>> time improvement there is.
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>> Tom MacDonald
>>>>>>>>>>>
>>>>>>>>>>> On Fri, 15 Aug 2014, Tom MacDonald wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Michael,
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you for your interest in Chapel!  It's good to hear you are
>>>>>>>>>>>> studying high performance programming languages.
>>>>>>>>>>>>
>>>>>>>>>>>> Are you compiling your Chapel program with the --fast option?
>>>>>>>>>>>>
>>>>>>>>>>>> We need to know that before looking deeper.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>>
>>>>>>>>>>>> Tom MacDonald
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, 15 Aug 2014, Michael Dietrich wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm working with Chapel due to my bachelor thesis about high
>>>>>>>>>>>>> performance programming languages. My current task is to
>>>>>>>>>>>>> implement a
>>>>>>>>>>>>> particle simulation program of which I already have the  
>>>>>>>>>>>>> code in C.
>>>>>>>>>>>>> It
>>>>>>>>>>>>> includes two possible algorithms for calculating the force acting
>>>>>>>>>>>>> onto
>>>>>>>>>>>>> the particles: A simple but ineffective one and the
>>>>>>>>>>>>> Barnes-Hut-Algorithm [1] which is much faster but a bit more
>>>>>>>>>>>>> complicated. The other calculations aren't that complex so for me
>>>>>>>>>>>>> only
>>>>>>>>>>>>> the calculation of force is important.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I implemented the simple algorithm at first. For comparing the
>>>>>>>>>>>>> serial
>>>>>>>>>>>>> and parallel execution time I surrounded everything with a
>>>>>>>>>>>>> serial-statement, evaluating a bool variable I have to set in the
>>>>>>>>>>>>> command line. I didn't implement the multi locale improvement yet
>>>>>>>>>>>>> so
>>>>>>>>>>>>> it runs only on a dual core PC, using forall-loops. Finally the
>>>>>>>>>>>>> parallel one only needed half of the time of the serial, yay.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I continued with Barnes-Hut. This one was a bit more work because
>>>>>>>>>>>>> the
>>>>>>>>>>>>> maintenance of the tree-structure leaves a lot of opportunities
>>>>>>>>>>>>> for
>>>>>>>>>>>>> mistakes. After a bit more time it was working as well.
>>>>>>>>>>>>> My issue is about the parallel execution time of this algorithm.
>>>>>>>>>>>>> Like
>>>>>>>>>>>>> in the other one I replaced the crucial for-loop with a
>>>>>>>>>>>>> forall-loop
>>>>>>>>>>>>> (the serial-statement surrounds the whole program). The problem
>>>>>>>>>>>>> is,
>>>>>>>>>>>>> that the parallel execution time is similar to the serial one,
>>>>>>>>>>>>> sometimes even longer.
>>>>>>>>>>>>> Of course I don't want you to read through all my code, but could
>>>>>>>>>>>>> you
>>>>>>>>>>>>> tell me some possible reasons, why this effect may occur?
>>>>>>>>>>>>>
>>>>>>>>>>>>> thank you very much
>>>>>>>>>>>>> bye
>>>>>>>>>>>>> Michael
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1] http://beltoforion.de/barnes_hut/barnes_hut_en.html
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>>>>>  --  
>>>>>>>>>>>>> ---------
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Chapel-users mailing list
>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/chapel-users
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --------------------------------------------------------------------
>>>>>>>>>>>>  --  
>>>>>>>>>>>> --------
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Chapel-users mailing list
>>>>>>>>>>>> [email protected]
>>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/chapel-users
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ----------------------------------------------------------------------
>>>>>>>>>>  --  
>>>>>>>>>> ------
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Chapel-users mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/chapel-users
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>  --  
>>>>>>>> ----
>>>>>>>> _______________________________________________
>>>>>>>> Chapel-users mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/chapel-users
>>>>>>
>>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>>  Slashdot  
>>>>> TV.
>>>>> Video for Nerds.  Stuff that matters.
>>>>> http://tv.slashdot.org/
>>>>> _______________________________________________
>>>>> Chapel-users mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/chapel-users
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>  Slashdot  
>>> TV.
>>> Video for Nerds.  Stuff that matters.
>>> http://tv.slashdot.org/
>>> _______________________________________________
>>> Chapel-users mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/chapel-users
>>
>>
>>




------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Re: Missing performance improvement at parallel execution

Reply via email to