Hello,

I would like to share some benchmarks on my dual-core (yes, I am a poor guy) machine. The benchmark measures A+A with A←1048576⍴2. The y-axis shows the CPU cycle counter which was recorded every 4096 iterations (so we have 256 samples on the x-axis). The first value was 0 (befor the loop was entered) and the final value was after exiting the loop
(in SkalarFunction::eval_skalar_AB().

The way I read these results is that the inner loop for skalar function scales linearly on a 2-core machine.

If the total time scales worse then either the sequential part is too big (Amdahls law), or something else is wrong. For example in one of my first tests everything compiled fine but still only one core was used.

/// Jürgen


On 03/14/2014 05:18 PM, David Lamkins wrote:
This is interesting. The parallel speedup on your machine using TBB is in the same ballpark as on my machine using OpenMP, and they're both delivering less than a 2:1 speedup.

I informally ran some experiments two nights ago to try to characterize the behavior. On my machine, with OpenMP #pragmas on the scalar loops, the ratio of single-threaded to multi-threaded runtimes held stubbornly at about 0.7 regardless of the size of the problem. I tried integer and float data, addition and power, with ravels up to 100 million elements. (My smallest test set was a million elements; I still need to try smaller sets to see whether I can find a knee where the thread setup overhead dominates and results in a runtime ratio greater than 1.)

I'm not sure what this means, yet. I'd hoped to see some further improvement as the ravel size increased, despite the internal inefficiencies. TBH, I didn't find and annotate the copy loop(s); that might have a lot to do with my results. (I did find and annotate another loop in check_value(), though. Maybe parallelizing that will improve your results.) I'm hoping that the poor showing so far isn't a result of memory bandwidth limitations.

I hope to spend some more time on this over the weekend.


P.S.: I will note that the nice part about using OpenMP is that there's no hand-coding necessary. All you do is add #pragmas to your program; the compiler takes care of the rewrites.


---------- Forwarded message ----------

    From: "Elias Mårtenson" <loke...@gmail.com <mailto:loke...@gmail.com>>
    To: "bug-apl@gnu.org <mailto:bug-apl@gnu.org>" <bug-apl@gnu.org
    <mailto:bug-apl@gnu.org>>
    Cc:
    Date: Fri, 14 Mar 2014 22:22:15 +0800
    Subject: [Bug-apl] Performance optimisations: Results
    Hello guys,

    I've spent some time experimenting with various performance
    optimisations and I would like to share my latest results with you:

    I've run a lot of tests using Callgrind, which is part of the
    Valgrind <http://valgrind.org/> tool (documentation here
    <http://valgrind.org/docs/manual/cl-manual.html>). In doing so,
    I've concluded that a disproportionate amount of time is spent
    copying values (this can be parallelised; more about that below).

    I set out to see how much faster I could make simple test program
    that applied a monadic scalar function. Here is my test program:

        ∇Z←testinv;tmp
        src←10000 4000⍴÷⍳100
        'Starting'
        tmp←{--------⍵} time src
        Z←1
        ∇


    This program calls my time operator which simply shows the amount
    of time it took to execute the operation. This is of course needed
    for benchmarking. For completeness, here is the implementation of
    time:

        ∇Z←L (OP time) R;start;end
        start←⎕AI
        →(0≠⎕NC 'L')/twoargs
        Z←OP R
        →finish
        twoargs:
        Z←L OP R
        finish:
        end←⎕AI
        'Time:',((end[3]+end[2]×1E6) - (start[3]+start[2]×1E6))÷1E6
        ∇


    The unmodified version of GNU APL runs this in
    *5037.00* milliseconds on my machine.

    I then set out to minimise the amount of cloning of values, taking
    advantage of the existing temp functionality. Once I had done
    this, the execution time was reduced to *2577.00* ms.

    I then used the Threading Building Blocks
    <https://www.threadingbuildingblocks.org/> library to parallelise
    two operations: The clone operation and the monadic
    SkalarFunction::eval_skalar_B(). After this, on my 4-core machine,
    the runtime was reduced to *1430.00* ms.

    Threading Building Blocks is available from the application
    repositories of at least Arch Linux and Ubuntu, and I'm sure it's
    available elsewhere too. To test in on OSX I had to download it
    separately.

    To summarise:

      * Standard: 5037.00
      * Reduced cloning: 2577.00
      * Parallel: 1430.00

    I have attached the patch, but it's definitely not something that
    should be applied blindly. I have hacked around is several parts
    of the code, some of which I can't say I understand fully, so see
    it as a proof-of-concept, nothing else.

    Note that the code that implements the parallelism using TBB is
    pretty ugly, and the code ends up being duplicated in the parallel
    and non-parallel version. This can, of course, be encapsulated
    much nicer if one wants to make this generic.

    Another thing, TBB is incredibly efficient, especially on Intel
    CPU's. I'd be very interested to see how OpenMP performs on this
    same code.

    Regards,
    Elias


--
"The secret to creativity is knowing how to hide your sources."
   Albert Einstein


http://soundcloud.com/davidlamkins
http://reverbnation.com/lamkins
http://reverbnation.com/lcw
http://lamkins-guitar.com/
http://lamkins.net/
http://successful-lisp.com/

<<attachment: two-cores.png>>

   0, 168
   1, 344610
   2, 673064
   3, 994497
   4, 1316056
   5, 1638490
   6, 1959482
   7, 2281776
   8, 2603300
   9, 2928142
  10, 3248644
  11, 3569132
  12, 3889956
  13, 4211144
  14, 4531310
  15, 4851546
  16, 5175296
  17, 5495462
  18, 5814326
  19, 6134856
  20, 6454000
  21, 6773466
  22, 7093933
  23, 7433902
  24, 7756784
  25, 8076390
  26, 8395310
  27, 8714636
  28, 9033192
  29, 9352896
  30, 9672544
  31, 9994747
  32, 10313660
  33, 10633336
  34, 10952536
  35, 11272303
  36, 11591251
  37, 11911312
  38, 12232934
  39, 12551910
  40, 12871285
  41, 13191038
  42, 13510350
  43, 13829144
  44, 14148456
  45, 14470274
  46, 14789236
  47, 15108212
  48, 15426971
  49, 15745394
  50, 16062991
  51, 16382261
  52, 16756110
  53, 17081295
  54, 17400432
  55, 17719674
  56, 18038454
  57, 18357031
  58, 18673998
  59, 18991546
  60, 19313728
  61, 19634048
  62, 19953269
  63, 20272378
  64, 20590878
  65, 20909924
  66, 21229110
  67, 21550823
  68, 21869946
  69, 22189146
  70, 22508654
  71, 22827924
  72, 23146956
  73, 23466030
  74, 23788394
  75, 24106040
  76, 24424883
  77, 24744342
  78, 25062975
  79, 25381552
  80, 25700374
  81, 26037060
  82, 26357114
  83, 26675530
  84, 26993652
  85, 27310794
  86, 27629147
  87, 27948151
  88, 28266924
  89, 28588630
  90, 28907130
  91, 29226134
  92, 29545180
  93, 29864100
  94, 30182726
  95, 30501394
  96, 30823590
  97, 31141572
  98, 31459708
  99, 31778229
 100, 32096435
 101, 32415082
 102, 32735178
 103, 33053468
 104, 33374929
 105, 33694066
 106, 34013756
 107, 34332760
 108, 34651519
 109, 34970082
 110, 35299404
 111, 35621082
 112, 35939904
 113, 36259055
 114, 36578241
 115, 36896601
 116, 37215598
 117, 37534336
 118, 37856539
 119, 38174528
 120, 38493798
 121, 38811948
 122, 39129146
 123, 39448171
 124, 39766440
 125, 40087068
 126, 40405589
 127, 40723837
 128, 41042512
 129, 41361544
 130, 41679624
 131, 41997004
 132, 42316526
 133, 42637273
 134, 42955878
 135, 43273916
 136, 43592948
 137, 43911826
 138, 44230368
 139, 44594116
 140, 44916886
 141, 45233048
 142, 45549532
 143, 45866786
 144, 46182598
 145, 46498144
 146, 46815489
 147, 47137230
 148, 47453630
 149, 47769505
 150, 48086640
 151, 48401927
 152, 48717515
 153, 49034475
 154, 49350896
 155, 49669886
 156, 49985684
 157, 50302966
 158, 50618344
 159, 50933988
 160, 51250220
 161, 51567040
 162, 51885834
 163, 52201086
 164, 52517444
 165, 52832815
 166, 53148340
 167, 53464600
 168, 53781112
 169, 54118526
 170, 54434282
 171, 54751228
 172, 55067194
 173, 55382726
 174, 55698804
 175, 56014147
 176, 56330120
 177, 56649726
 178, 56967400
 179, 57283884
 180, 57601404
 181, 57918350
 182, 58234666
 183, 58550338
 184, 58869594
 185, 59184944
 186, 59500378
 187, 59816960
 188, 60132821
 189, 60449431
 190, 60765964
 191, 61083932
 192, 61399800
 193, 61716424
 194, 62032768
 195, 62349315
 196, 62664980
 197, 62981800
 198, 63309197
 199, 63628061
 200, 63944965
 201, 64260952
 202, 64577590
 203, 64894431
 204, 65209830
 205, 65525404
 206, 65844016
 207, 66160773
 208, 66476788
 209, 66792327
 210, 67109336
 211, 67424826
 212, 67739868
 213, 68058438
 214, 68375216
 215, 68691322
 216, 69008170
 217, 69324766
 218, 69640186
 219, 69956320
 220, 70272090
 221, 70591780
 222, 70908096
 223, 71223348
 224, 71539720
 225, 71855182
 226, 72170448
 227, 72529289
 228, 72854768
 229, 73172932
 230, 73491873
 231, 73812788
 232, 74132177
 233, 74450908
 234, 74770129
 235, 75093508
 236, 75410881
 237, 75728058
 238, 76045123
 239, 76365569
 240, 76683824
 241, 77002506
 242, 77324191
 243, 77642572
 244, 77961282
 245, 78279817
 246, 78596630
 247, 78915305
 248, 79233868
 249, 79552634
 250, 79875516
 251, 80194058
 252, 80512754
 253, 80831597
 254, 81151700
 255, 81470620
266, 81834662

   0, 329994
   1, 700805
   2, 1055285
   3, 1401722
   4, 1750224
   5, 2096297
   6, 2433767
   7, 2764090
   8, 3091312
   9, 3417225
  10, 3743502
  11, 4069653
  12, 4397246
  13, 4722543
  14, 5047273
  15, 5399695
  16, 5724985
  17, 6049575
  18, 6373717
  19, 6700295
  20, 7025263
  21, 7349384
  22, 7674674
  23, 8000055
  24, 8324981
  25, 8650236
  26, 8976373
  27, 9301397
  28, 9628493
  29, 9953895
  30, 10278905
  31, 10605672
  32, 10931907
  33, 11257043
  34, 11582494
  35, 11907399
  36, 12232318
  37, 12557321
  38, 12882086
  39, 13207215
  40, 13533835
  41, 13860651
  42, 14185801
  43, 14510629
  44, 14888741
  45, 15213534
  46, 15538747
  47, 15864107
  48, 16189026
  49, 16513959
  50, 16838878
  51, 17163097
  52, 17488149
  53, 17814510
  54, 18139898
  55, 18465265
  56, 18790891
  57, 19115691
  58, 19440932
  59, 19766194
  60, 20091400
  61, 20416277
  62, 20740832
  63, 21065807
  64, 21390649
  65, 21714623
  66, 22039920
  67, 22364965
  68, 22690087
  69, 23016049
  70, 23341507
  71, 23665915
  72, 24054205
  73, 24380426
  74, 24704785
  75, 25029984
  76, 25354462
  77, 25680172
  78, 26005098
  79, 26330276
  80, 26654068
  81, 26978490
  82, 27303843
  83, 27630302
  84, 27955235
  85, 28279839
  86, 28606942
  87, 29037477
  88, 29370565
  89, 29695519
  90, 30020396
  91, 30345644
  92, 30671074
  93, 30996931
  94, 31321409
  95, 31645971
  96, 31971198
  97, 32297503
  98, 32622933
  99, 32947271
 100, 33290887
 101, 33616562
 102, 33941705
 103, 34266995
 104, 34592383
 105, 34917015
 106, 35241899
 107, 35566125
 108, 35890715
 109, 36215193
 110, 36541386
 111, 36866599
 112, 37192981
 113, 37519013
 114, 37844212
 115, 38169551
 116, 38495499
 117, 38821944
 118, 39147766
 119, 39472755
 120, 39797807
 121, 40122061
 122, 40447596
 123, 40773404
 124, 41099667
 125, 41425055
 126, 41750394
 127, 42074837
 128, 329973
 129, 692237
 130, 1050119
 131, 1394449
 132, 1742629
 133, 2109072
 134, 2439416
 135, 2768787
 136, 3095575
 137, 3420998
 138, 3746267
 139, 4072194
 140, 4481806
 141, 4805766
 142, 5132162
 143, 5458481
 144, 5782434
 145, 6106835
 146, 6430802
 147, 6760726
 148, 7083335
 149, 7406868
 150, 7731563
 151, 8056468
 152, 8381562
 153, 8705018
 154, 9031946
 155, 9355192
 156, 9678354
 157, 10000473
 158, 10324314
 159, 10646748
 160, 10969630
 161, 11298203
 162, 11620567
 163, 11943323
 164, 12265253
 165, 12588065
 166, 12913068
 167, 13236972
 168, 13584578
 169, 13910085
 170, 14231560
 171, 14554190
 172, 14878570
 173, 15200416
 174, 15522122
 175, 15845025
 176, 16170476
 177, 16493617
 178, 16815806
 179, 17138037
 180, 17460394
 181, 17782072
 182, 18104555
 183, 18432176
 184, 18754428
 185, 19076148
 186, 19397770
 187, 19720260
 188, 20043128
 189, 20365877
 190, 20693547
 191, 21016156
 192, 21339353
 193, 21661857
 194, 21984158
 195, 22305402
 196, 22627871
 197, 22970878
 198, 23291898
 199, 23614227
 200, 23937928
 201, 24260355
 202, 24582208
 203, 24904404
 204, 25231283
 205, 25553115
 206, 25875157
 207, 26198284
 208, 26521894
 209, 26844797
 210, 27166972
 211, 27488230
 212, 27815116
 213, 28136311
 214, 28457975
 215, 29644202
 216, 29980587
 217, 30302195
 218, 30624433
 219, 30947336
 220, 31269945
 221, 31593408
 222, 31915772
 223, 32314625
 224, 32637157
 225, 32959605
 226, 33282704
 227, 33604459
 228, 33927138
 229, 34250272
 230, 34577991
 231, 34899067
 232, 35220472
 233, 35542605
 234, 35865158
 235, 36189307
 236, 36510761
 237, 36833664
 238, 37163714
 239, 37485427
 240, 37808246
 241, 38130841
 242, 38453506
 243, 38776654
 244, 39099004
 245, 39425561
 246, 39747190
 247, 40069449
 248, 40392170
 249, 40714128
 250, 41035918
 251, 41384420
 252, 41709689
 253, 42031696
 254, 42353647
 255, 42673624
266, 43046192

Reply via email to