El Miércoles, 29 de enero de 2014 16:09:18 Akihiro Hayashi escribió:
> Hi, Brad,
> 
> Thank you for your suggestions.
> 
> > I'm always glad when we can pass the blame to something other than Chapel.
> >  :)
> Exactly :)
> 
> I tried --enable-debug flag at GASNet configuration time and I rebuilt my
> Chapel compiler and runtime. Then I compile and run the program with
> GASNET_BACKTRACE=1,  But i don't get any message. Some debug messages are
> supposed to be shown during program execution if we configure GASNet with
> --enable-debug flag?
> 
> Please let me know if you have any suggestions.

First you can check more things from Chapel, by compiling with:

 --bounds-checks --local-checks --nil-checks --debug

And it can help also to wrap the Chapel communications with :

startVerboseComm();
...
stopVerboseComm();

From the gasnet side, you can activate a trace to gasnet, so all operations 
are shown, and also you can get statistics, to activate them you must add 
those options to gasnet, in the file :

third-party/gasnet/Makefile

Add to gasnet those options :

CHPL_GASNET_CFG_OPTIONS += --enable-segment-$(CHPL_MAKE_COMM_SEGMENT) 
--enable-allow-gcc4 --enable-stats --enable-trace

Depending upon the mxm version of your system, perhaps you need to add : 
--disable-mxm

After that to see the trace and the stats you must define those vars:

export GASNET_TRACEFILE=gas_tracefile.out
export GASNET_STATSFILE=gas_statsfile.out

They contain the name of the file where the info will be written.

Then you can execute your program with parameters that give a right result, 
and another with a wrong result, and compare the gasnet traces.


I have tried to use the gasnet tests, I did that on Christmas, but cannot 
remember how I sent it to the queue system... basically is the testvis test, 
that can be compiled from the ibv_conduit using make test-seq.

As I said I did that, and it worked fine, although I didn't changed the program 
to test other array sizes.


Hope this helps,

Rafael


> Best,
> 
> Akihiro
> 
> On Jan 29, 2014, at 2:40 PM, Brad Chamberlain wrote:
> > Hi guys --
> > 
> > While I'm sorry about the time spent on this issue, I'm always glad when
> > we can pass the blame to something other than Chapel.  :)
> > 
> > Something I'm wondering is whether, if GASNet is built with runtime checks
> > on (I think this is done using --enable-debug at GASNet configuration
> > time), will the ibv-conduit issue show up as a runtime assertion or
> > something other than a bug?  I think we may want to pass along a report
> > of this issue to the GASNet team, in which case this is undoubtedly the
> > first question they'll ask.
> > 
> > It would also be great if we had a standalone C+GASNet program that
> > exhibited the issue (as they're not deeply invested in Chapel), but if
> > that isn't a 15-minute exercise, we can tell them how to reproduce in
> > Chapel.
> > 
> > Thanks,
> > -Brad
> > 
> > On Wed, 29 Jan 2014, Akihiro Hayashi wrote:
> >> Hi, Rafael,
> >> 
> >> Thanks for your reply. I inlined my comments below:
> >>> I’ve been talking to Rafael Larrosa regarding the issue you are
> >>> reporting. He has conducted some experiments with your code on Titan
> >>> (Cray XK7 at ORNL, which features Cray Gemini interconnect) and there
> >>> are no communication problem there. However, we have here in Malaga a
> >>> cluster based on Infiniband (ibv-conduit) and the execution fails on
> >>> that platform. I’ve also confirmed that udp-conduit does not pose any
> >>> problem.>> 
> >> I really appreciate your and Rafael Larrosa's experiments. I'm glad to
> >> hear that this is not a problem in Chapel compiler.>> 
> >>> Rafael Larrosa told me that he faced the same issue last month and that
> >>> after spending more than two weeks tackling the problem he is almost
> >>> sure that there is a bug (or a maximum buffer size limitation) in the
> >>> ibv-conduit implementation of gasnet. For his code, he found a turning
> >>> point when the transferred buffer was 128MBytes: smaller communications
> >>> work fine, but larger always fail. He says it was tricky, because when
> >>> you try to isolate the problem (i.e. isolate the particular transfer
> >>> that fails by executing just this single communication) then the
> >>> problem vanish. So it will be challenging to chase this bug.>> 
> >> Sure, now I understand there is a bug in the ibv-conduit implementation
> >> of gasnet. and yes, It seems fixing the bug is very difficult. 
> >> Actually, an original benchmark I want to run spawns many tasks by begin
> >> statement and each task does bulk transfer. I can imagine the benchmark
> >> exceeds some limit like Rafael's code.  I'm also wondering why the
> >> simplified code has this problem. There might be another problem.>> 
> >>> You may want to circumvent the bug by:
> >>> 
> >>> 1.- Not using bulkComms optimization (-suseBulkTransferStride=false
> >>> -suseBulkTransfer=false). —> Slower comms. 2.- Implementing a version
> >>> of bulkComms that splits big messages into smaller ones. —> wearisome
> >>> tinkering 3.- Avoid ibv-conduit —> 207 out of the 500 supercomputers in
> >>> latest top500 list are based on ibv 4.- Dive into the ibv-conduit
> >>> implementation —> Probably not your main research goal
> >>> 
> >>> For the time being we are conducting all our experiments on Cray
> >>> machines, so we do not plan (and do not have time) to tackle 2 or 4, so
> >>> we are getting by with 3.>> 
> >> Exactly, 4 is not my research goal, I 'd choose 3 if a benchmark I would
> >> like to run use bulk transfer. Thanks for your suggestions.>> 
> >>> If Rafael wants to chime in, he can probably give you more details and
> >>> advices, should you want to debug your code at a lower level.>> 
> >> I would appreciate if he could give me more details. I think I should
> >> mention the bug in my paper or something.
> >> 
> >> Best,
> >> 
> >> Akihiro
> >> 
> >> On Jan 29, 2014, at 4:46 AM, Rafael Asenjo Plaza wrote:
> >>> Hi Akihiro,
> >>> 
> >>> I’ve been talking to Rafael Larrosa regarding the issue you are
> >>> reporting. He has conducted some experiments with your code on Titan
> >>> (Cray XK7 at ORNL, which features Cray Gemini interconnect) and there
> >>> are no communication problem there. However, we have here in Malaga a
> >>> cluster based on Infiniband (ibv-conduit) and the execution fails on
> >>> that platform. I’ve also confirmed that udp-conduit does not pose any
> >>> problem.
> >>> 
> >>> Rafael Larrosa told me that he faced the same issue last month and that
> >>> after spending more than two weeks tackling the problem he is almost
> >>> sure that there is a bug (or a maximum buffer size limitation) in the
> >>> ibv-conduit implementation of gasnet. For his code, he found a turning
> >>> point when the transferred buffer was 128MBytes: smaller communications
> >>> work fine, but larger always fail. He says it was tricky, because when
> >>> you try to isolate the problem (i.e. isolate the particular transfer
> >>> that fails by executing just this single communication) then the
> >>> problem vanish. So it will be challenging to chase this bug.
> >>> 
> >>> You may want to circumvent the bug by:
> >>> 
> >>> 1.- Not using bulkComms optimization (-suseBulkTransferStride=false
> >>> -suseBulkTransfer=false). —> Slower comms. 2.- Implementing a version
> >>> of bulkComms that splits big messages into smaller ones. —> wearisome
> >>> tinkering 3.- Avoid ibv-conduit —> 207 out of the 500 supercomputers in
> >>> latest top500 list are based on ibv 4.- Dive into the ibv-conduit
> >>> implementation —> Probably not your main research goal
> >>> 
> >>> For the time being we are conducting all our experiments on Cray
> >>> machines, so we do not plan (and do not have time) to tackle 2 or 4, so
> >>> we are getting by with 3.
> >>> 
> >>> If Rafael wants to chime in, he can probably give you more details and
> >>> advices, should you want to debug your code at a lower level.
> >>> 
> >>> Regards,
> >>> 
> >>> Rafa.
> >>> 
> >>> El 28/01/2014, a las 19:31, Akihiro Hayashi <[email protected]> 
escribió:
> >>>> Hi, Rafael,
> >>>> 
> >>>> Sorry for the delayed reply.
> >>>> Let me share the program that reproduces the problem. (attached below)
> >>>> 
> >>>> As you can see, the program prints "INVALID? : true" if we get bulk
> >>>> copy transfer error, otherwise it prints "INVALID?: false". I get the
> >>>> error when I run the program on 2 locales with ibv-conduit
> >>>> (mpi-spawner). The input data size is : matrixSize = 2000 and tileSize
> >>>> = 200. Please let me know if you want the input file. Note that I
> >>>> don't get the error when I run the program on 1 locale. In addition, I
> >>>> don't get the error with smaller data size even on 2 or more locales
> >>>> (e.g 10x10 matrix and 2x2 tile size). I'm guessing using ibv-conduit
> >>>> and transferring a certain amount of data incurs this problem. FYI,
> >>>> using udp-conduit (amudprun) does not show the error.
> >>>> 
> >>>> Please let me know if you have any comments and questions.
> >>>> 
> >>>> Best,
> >>>> 
> >>>> Akihiro
> >>>> 
> >>>> --
> >>>> 
> >>>> use BlockDist;
> >>>> 
> >>>> config const matrixSize: int(32) = -1;
> >>>> config const   tileSize: int(32) = -1;
> >>>> config const     inFile: string = "m_2000.in";
> >>>> const zero: int(32) = 0;
> >>>> var tile_array_indices = {zero..tileSize-1,zero..tileSize-1};
> >>>> 
> >>>> class Tile {
> >>>> var tile_array: [tile_array_indices] real;
> >>>> }
> >>>> 
> >>>> proc read_2D_array ( fileName: string, matrixSize: int(32) ) {
> >>>> var input_stream = open (fileName, iomode.r);
> >>>> var reader = input_stream.reader();
> >>>> var matrix_index_2D = {0..matrixSize-1, 0..matrixSize-1};
> >>>> var array: [matrix_index_2D] real;
> >>>> 
> >>>> for ij in matrix_index_2D do {
> >>>> 
> >>>>     reader.read(array(ij));
> >>>> 
> >>>> }
> >>>> input_stream.close();
> >>>> reader.close();
> >>>> // if (debug) { writeln("whole array: ",array); }
> >>>> return array;
> >>>> }
> >>>> 
> >>>> proc main(): void {
> >>>> writeln("numLocales : ", numLocales);
> >>>> 
> >>>> var numTiles: int(32) = matrixSize/tileSize;
> >>>> var numTiles_2: int(64) = matrixSize/tileSize;
> >>>> 
> >>>> var whole_array = read_2D_array(inFile, matrixSize);
> >>>> 
> >>>> var proto_ijk_space = {zero..numTiles_2-1, zero..numTiles_2,
> >>>> zero..numTiles_2}; var ijk_space = proto_ijk_space dmapped
> >>>> Block(boundingBox=proto_ijk_space); var lkji_tiles: [ijk_space] Tile;
> >>>> 
> >>>> for i in zero..numTiles-1 do {
> >>>> 
> >>>>     for j in zero..i do {
> >>>>     
> >>>>         on lkji_tiles(i,j,zero).locale do {
> >>>>         
> >>>>             var curr_tile: Tile = new Tile();
> >>>>          
> >>>>          for (ii,jj) in tile_array_indices do {
> >>>>          
> >>>>                 curr_tile.tile_array(ii,jj) =
> >>>>                 whole_array(i*tileSize+ii,j*tileSize+jj);
> >>>>          
> >>>>          }
> >>>>          
> >>>>             lkji_tiles(i,j,zero) = curr_tile;
> >>>>         
> >>>>         }
> >>>>  
> >>>>  }
> >>>> 
> >>>> }
> >>>> var invalid : bool = false;
> >>>> for i in zero..numTiles-1 do {
> >>>> 
> >>>>  for iB in zero..tileSize-1 do {
> >>>>  
> >>>>         for j in zero..i do {
> >>>>          
> >>>>          var temp = lkji_tiles(i,j,zero).tile_array;
> >>>>          if(i != j) {
> >>>>          
> >>>>                 for jB in zero..tileSize-1 do {
> >>>>                  
> >>>>                  if (temp(iB,jB) != lkji_tiles(i, j, 
> >>>> zero).tile_array(iB, 
jB)) {
> >>>>                  
> >>>>                         invalid = true;
> >>>>                  
> >>>>                  }
> >>>>                  
> >>>>                 }
> >>>>          
> >>>>          } else {
> >>>>          
> >>>>                 for jB in zero..iB do {
> >>>>                  
> >>>>                  if (temp(iB,jB) != lkji_tiles(i, j, 
> >>>> zero).tile_array(iB, 
jB)) {
> >>>>                  
> >>>>                         invalid = true;
> >>>>                  
> >>>>                  }
> >>>>                  
> >>>>                 }
> >>>>          
> >>>>          }
> >>>>          
> >>>>         }
> >>>>  
> >>>>  }
> >>>> 
> >>>> }
> >>>> writeln("INVALID? : ", invalid);
> >>>> 
> >>>> }
> >>>> 
> >>>> On Jan 22, 2014, at 1:46 PM, Akihiro Hayashi wrote:
> >>>>> Hi Rafael,
> >>>>> 
> >>>>> Thanks for your reply.
> >>>>> 
> >>>>> I inlined my comments below:
> >>>>>> May we have a simplified copy of your code (kinda the snippet
> >>>>>> provided below but with initial values for tileSize, numTiles_2, k,
> >>>>>> etc. i.e. something that compiles) so that we can also give it a 
go?>>>>> 
> >>>>> Yes, it would be better if we can have a simplified code.
> >>>>> Actually, I have been trying to make a simple code that reproduce this
> >>>>> problem for several weeks. finally I managed to make it this morning.
> >>>>> Let me ask my advisor if we can show you the code.
> >>>>> 
> >>>>>> Would you like to try also with these flags?:
> >>>>>> 
> >>>>>> -suseBulkTransferStride=true -suseBulkTransfer=false
> >>>>> 
> >>>>> I tried these flags, but I still get the error.
> >>>>> 
> >>>>> I'll keep you updated.
> >>>>> 
> >>>>> Best,
> >>>>> 
> >>>>> Akihiro
> >>>>> 
> >>>>> On Jan 22, 2014, at 5:23 AM, Rafael Asenjo Plaza wrote:
> >>>>>> Hi Akihiro,
> >>>>>> 
> >>>>>> May we have a simplified copy of your code (kinda the snippet
> >>>>>> provided below but with initial values for tileSize, numTiles_2, k,
> >>>>>> etc. i.e. something that compiles) so that we can also give it a go?
> >>>>>> 
> >>>>>> Would you like to try also with these flags?:
> >>>>>> 
> >>>>>> -suseBulkTransferStride=true -suseBulkTransfer=false
> >>>>>> 
> >>>>>> Thank you,
> >>>>>> 
> >>>>>> Rafa.
> >>>>>> 
> >>>>>> El 21/01/2014, a las 18:33, Akihiro Hayashi <[email protected]> 
escribió:
> >>>>>>> Dear Chapel developers,
> >>>>>>> 
> >>>>>>> This is Akihiro Hayashi, postdoc at Rice University.
> >>>>>>> I'm writing this to ask array copy failure in chapel.
> >>>>>>> 
> >>>>>>> I'm trying to evaluate some chapel benchmark across multiple nodes
> >>>>>>> but I get strange error. Please note that I'm using old version of
> >>>>>>> chapel compiler (r21945) with qthread-1.10 and
> >>>>>>> GASNet-1.20.2(infiniband-conduit, mpi-spawner) because the latest
> >>>>>>> version does not work. With the latest version of chapel compiler
> >>>>>>> (r22568) with qthread-1.10 and GASNet-1.22.0(infiniband-conduit,
> >>>>>>> mpi-spawner), I get SEGV when running simple program (coforall loc
> >>>>>>> in Locales do on loc { writeln(loc); }) across multiple nodes with
> >>>>>>> mpi spawner. This is another problem but I have not investigated
> >>>>>>> this problem yet. I'll work on this later.
> >>>>>>> 
> >>>>>>> The following problem might be fixed in the latest version, but any
> >>>>>>> comments and suggestions are appreciated. Here is part of my code.
> >>>>>>> The main data structure is a 3-dimensional array, which is declared
> >>>>>>> as a distributed array that each of its element refers to a
> >>>>>>> 2-dimension array. You can see array copy statement (liBlock =
> >>>>>>> lkji_tiles(k,k,k+1).tile_array;) in Line 11. I want to use this
> >>>>>>> copy statement because the Chapel compiler generates bulk transfer
> >>>>>>> code, which accelerates program execution.
> >>>>>>> 
> >>>>>>> // Code
> >>>>>>> 1: const zero: int(32) = 0;
> >>>>>>> 2: var tile_array_indices = {zero..tileSize-1,zero..tileSize-1};
> >>>>>>> 3: class Tile {
> >>>>>>> 4:    var tile_array: [tile_array_indices] real;
> >>>>>>> 5: }
> >>>>>>> 6: var proto_ijk_space = {zero..numTiles_2-1, zero..numTiles_2,
> >>>>>>> zero..numTiles_2}; 7: var ijk_space = proto_ijk_space dmapped
> >>>>>>> Block(boundingBox=proto_ijk_space); 8: var lkji_tiles: [ijk_space]
> >>>>>>> Tile;
> >>>>>>> ...
> >>>>>>> 9 :begin {
> >>>>>>> ...
> >>>>>>> 10:   var liBlock: [tile_array_indices] real;
> >>>>>>> 11:   liBlock = lkji_tiles(k,k,k+1).tile_array;
> >>>>>>> 12:   for (m,n) in tile_array_indices {
> >>>>>>> 13:     if (liBlock(m,n) != lkji_tiles(k,k,k+1).tile_array(m,n)) {
> >>>>>>> 14:        invalid = true;
> >>>>>>> 15:     }
> >>>>>>> 16:   }
> >>>>>>> 17:   if (invalid) { writln("Copy Failed");}
> >>>>>>> 18:   ...
> >>>>>>> 19: }
> >>>>>>> ...
> >>>>>>> 
> >>>>>>> In my experiment, when running the program on 2 or more locales, the
> >>>>>>> program prints "Copy Failed" which means  "liBlock =
> >>>>>>> lkji_tiles(k,k,k+1).tile_array;" in Line 11 failed. This happens
> >>>>>>> sometime (not always). and I confirmed the copy is successfully
> >>>>>>> done if I replace the array copy in Line 11 with copy loop.
> >>>>>>> Additionally, I also see the same behavior when I replace the array
> >>>>>>> copy in Line 11 with
> >>>>>>> liBlock._value.doiBulkTransfer(lkji_tiles(k,k,k+1).tile_array);.
> >>>>>>> 
> >>>>>>> Here is an output log at runtime when I compile the program with -s
> >>>>>>> debugBulkTransfer (tileSize=200):
> >>>>>>> 
> >>>>>>> -- Log starts here
> >>>>>>> In DefaultRectangularArr.doiBulkTransfer(): Alo=(0, 0), Blo=(0, 0),
> >>>>>>> len=40000, elemSize=8; -- End of Log
> >>>>>>> 
> >>>>>>> In both cases, the runtime internally calls chpl_comm_get API(*) and
> >>>>>>> the API takes the above parameters. I think it looks good.
> >>>>>>> (*) Please take a look at doiBulkTransfer function in
> >>>>>>> CHPL_HOME/modules/internal/DefaultRectangular.chpl
> >>>>>>> 
> >>>>>>> Any comments and suggestions are appreciated.
> >>>>>>> 
> >>>>>>> Best regards,
> >>>>>>> 
> >>>>>>> Akihiro
> >>>>>>> --------------------------------------------------------------------
> >>>>>>> ---------- CenturyLink Cloud: The Leader in Enterprise Cloud
> >>>>>>> Services.
> >>>>>>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
> >>>>>>> Critical Workloads, Development Environments & Everything In
> >>>>>>> Between.
> >>>>>>> Get a Quote or Start a Free Trial Today.
> >>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ost
> >>>>>>> g.clktrk _______________________________________________
> >>>>>>> Chapel-developers mailing list
> >>>>>>> [email protected]
> >>>>>>> https://lists.sourceforge.net/lists/listinfo/chapel-developers
> >>>>>> 
> >>>>>> __
> >>>>>> Rafael Asenjo Plaza
> >>>>>> Dept. Arquitectura de Computadores
> >>>>>> Complejo Tecnologico Campus de Teatinos
> >>>>>> E-29071 MALAGA (SPAIN)
> >>>>>> Tel: +34 95 213 27 91
> >>>>>> Fax: +34 95 213 27 90
> >>>>>> http://www.ac.uma.es/~asenjo
> >>> 
> >>> __
> >>> Rafael Asenjo Plaza
> >>> Dept. Arquitectura de Computadores
> >>> Complejo Tecnologico Campus de Teatinos
> >>> E-29071 MALAGA (SPAIN)
> >>> Tel: +34 95 213 27 91
> >>> Fax: +34 95 213 27 90
> >>> http://www.ac.uma.es/~asenjo
> >> 
> >> -------------------------------------------------------------------------
> >> ----- WatchGuard Dimension instantly turns raw network data into
> >> actionable security intelligence. It gives you real-time visual feedback
> >> on key security issues and trends.  Skip the complicated setup - simply
> >> import a virtual appliance and go from zero to informed in seconds.
> >> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clk
> >> trk _______________________________________________
> >> Chapel-developers mailing list
> >> [email protected]
> >> https://lists.sourceforge.net/lists/listinfo/chapel-developers
> 
> ----------------------------------------------------------------------------
> -- WatchGuard Dimension instantly turns raw network data into actionable
> security intelligence. It gives you real-time visual feedback on key
> security issues and trends.  Skip the complicated setup - simply import a
> virtual appliance and go from zero to informed in seconds.
> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
> _______________________________________________
> Chapel-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/chapel-developers
-- 
Rafael Larrosa Jiménez
Centro de Supercomputación y Bioinformática - http://www.scbi.uma.es
Universidad de Málaga

EMAIL: [email protected]          Edificio de Bioinnovación
TELEF: + 34951952788            C/ Severo Ochoa 34
FAX  : +34951952792                     Parque Tecnológico de Andalucía
                                                29590 Málaga (SPAIN)


------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Reply via email to