Re: Missing performance improvement at parallel execution

Lydia Duncan Wed, 27 Aug 2014 08:49:51 -0700

Hello,

It sounds like you want a copy of the data rather than a pure referenceto it for the individual calls, so perhaps what you are looking for aredifferent intents used on the arguments to the calculation function orits children? If the function/task calling the function should beworking on its own copy of the data and no other calls will utilize thatdata, then you may want to specify that the data has an "in" intent,meaning you want to copy that data into the function but don't careabout what happens to it within the function after it has exited. Ifyou expect other calls to use the data afterwards, but not untileverything has completed, consider using an "inout" intent, which makesan independent copy of the data for use in the function and then copiesits end state back to the caller's representation of the data. Notethat if these intents are not used carefully, unnecessary copies may bemade or made repeatedly, making your code slower than before!


An example use of the "in" intent on a simple value:

var foo = 3;
writeln(foo);
useMe(foo);
writeln(foo);

proc useMe(in arg: int) {
    arg = arg + 3;
    writeln(arg);
    return;
}

Running that simple program will generate:

3
6
3

as output.  Modifying the intent so that the function header reads:

proc useMe(inout arg: int) {

will instead generate:

3
6
6

Note that using ref (as you had previously done) instead of inout willresult in the same output, but may have a performance penalty orimprovement depending on the situation. Because what you are using areclasses, specifying ref should not be necessary (as Ben said earlier inthis conversation).


Hope that helps!
Lydia

On 08/27/2014 07:40 AM, Michael Dietrich wrote:

Hello,

I have a theory for that misbehave in my program:

In the simple algorithm the tasks do access to every particle's dataonce per step but do this mostly reading, so there is no possibleconflict. The writing part happens only in the end of every stepindependently to the other particles. There can no problems occur thusit works as expected.

However the Barnes Hut algorithm is based on a tree-structure. It isinitialized and assembled always by the first task. So the flaw mustbe in the forall loop when the tree is used for calculating the force.In the solution in MPI every process holds its' own tree, but inChapel with several tasks on one computer which use the same tree itmay be inevitable that the algorithm becomes slower instead of faster.

Are these thoughts going into the right direction? How may this issuebe dealt with?


bye


Zitat von Michael Dietrich <[email protected]>:

Hi,

Thank you for your hint. I'm aware of that, but the calc_force
algorithm is executed within every program step so this part does not
(only) initialize but also reset the values.

Tomorrow I will try the zipped iteration... please note in my place
the time is nine hours later. ;)

Today I tried to run Chapel on more than one locale. With a simple
algorithm it worked very fine but with my particle simulation... let's
find a solution for the first problem I contaced you with, before I
get to the next one. :)

bye


Zitat von Lydia Duncan <[email protected]>:

Hi Michael,

One thing to note - Chapel variables are initialized by default.
Since you provided a type for the fields of a particle, you don't
need to set their values to 0.0 as that is done automatically.

Lydia

On 08/20/2014 09:58 AM, Ben Harshbarger wrote:

Hi Michael,

I think you’re going to want a zippered iteration:

forall (part, i) in zip(field.p, field.D) { … }

-Ben

On 8/20/14, 7:38 AM, "Michael Dietrich"
<[email protected]> wrote:

Hello,

yes, you're right, I continued using the C-syntax. Now it looks much
better.

One question about this to make it perfect: If i write...

forall part in field.p {
  part.f.x = 0.0;
  part.f.y = 0.0;
  part.f.z = 0.0;
}

...how can I get the index of "part" within the array "p"?

bye


Zitat von Ben Harshbarger <[email protected]>:

Hi Michael,

Looking through your code I saw that you were passing classes as
arguments
with the ‘ref’ intent. Chapel classes are passed as references by
default,
so I don’t think you need the ref intent in most of those functions:

// this should be fine
proc calc_force(parts : ParticleField) { … }

Something else you can do to improve your code’s performance and
readability is to iterate directly over an array instead of its
underlying
domain:

forall part in field.p {
  part.f.x = 0.0;
  part.f.y = 0.0;
  part.f.z = 0.0;
}

// instead of

forall i in 0..field.num_parts-1 {
  field.p[i].f.x = 0.0;
  field.p[i].f.y = 0.0;
  field.p[i].f.z = 0.0;
}



Let me know if you have any additional questions.

-Ben


On 8/18/14, 11:18 AM, "Michael Dietrich"
<[email protected]> wrote:

Hello,

okay, if you want to you can read through the code [1]. :)
For comparation it includes the program in C which I successfully
parallelized with MPI and the same program in Chapel.
I hope I didn't forget to comment important lines. If anything is
unclear just ask me.

Please consider the code won't win a beauty contest (especially the
one in C) but at least my Chapel code is more readable than thecode
in C mostly written by the responsible employee. ;)

bye

[1] http://www-user.tu-chemnitz.de/~michd/fish.tar.gz


Zitat von Lydia Duncan <[email protected]>:
Hi Michael,

I think it will be difficult for us to offer further suggestions
without looking at the code ourselves, would you be comfortable
sending it?

Lydia

On 08/18/2014 05:27 AM, Michael Dietrich wrote:
Hello,

today I compiled my program with the --fast flag.
Though it made both the serial execution much faster thanbefore, theparallel one became even slower compared with the serial. I'mstilltrying to find any possible mistakes within my code but Ican't find
any. Any suggestions?

bye



Zitat von Tom MacDonald <[email protected]>:
Hi Michael,
Without --fast the compiler generates runtime checks, whichslows
down execution time.

Please compile with the --fast flag and see how much execution
time improvement there is.

Thanks

Tom MacDonald

On Fri, 15 Aug 2014, Tom MacDonald wrote:
Hi Michael,
Thank you for your interest in Chapel! It's good to hearyou are
studying high performance programming languages.

Are you compiling your Chapel program with the --fast option?

We need to know that before looking deeper.

Thanks

Tom MacDonald

On Fri, 15 Aug 2014, Michael Dietrich wrote:
Hello,

I'm working with Chapel due to my bachelor thesis about high
performance programming languages. My current task is to
implement a
particle simulation program of which I already have thecode in C.
It
includes two possible algorithms for calculating the forceacting
onto
the particles: A simple but ineffective one and the
Barnes-Hut-Algorithm [1] which is much faster but a bit more
complicated. The other calculations aren't that complex sofor me
only
the calculation of force is important.

I implemented the simple algorithm at first. For comparing the
serial
and parallel execution time I surrounded everything with a
serial-statement, evaluating a bool variable I have to setin thecommand line. I didn't implement the multi localeimprovement yet
so
it runs only on a dual core PC, using forall-loops. Finallythe
parallel one only needed half of the time of the serial, yay.
I continued with Barnes-Hut. This one was a bit more workbecause
the
maintenance of the tree-structure leaves a lot ofopportunities
for
mistakes. After a bit more time it was working as well.
My issue is about the parallel execution time of thisalgorithm.
Like
in the other one I replaced the crucial for-loop with a
forall-loop
(the serial-statement surrounds the whole program). Theproblem
is,
that the parallel execution time is similar to the serial one,
sometimes even longer.
Of course I don't want you to read through all my code, butcould
you
tell me some possible reasons, why this effect may occur?

thank you very much
bye
Michael


[1] http://beltoforion.de/barnes_hut/barnes_hut_en.html
-------------------------------------------------------------------
--
---------
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users
--------------------------------------------------------------------
--
--------
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users
----------------------------------------------------------------------
--
------
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users
------------------------------------------------------------------------
--
----
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

------------------------------------------------------------------------------

Slashdot TV.
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

------------------------------------------------------------------------------

Slashdot TV.
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/

_______________________________________________
Chapel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-users

Re: Missing performance improvement at parallel execution

Reply via email to