Hi Akihiro, I’ve been talking to Rafael Larrosa regarding the issue you are reporting. He has conducted some experiments with your code on Titan (Cray XK7 at ORNL, which features Cray Gemini interconnect) and there are no communication problem there. However, we have here in Malaga a cluster based on Infiniband (ibv-conduit) and the execution fails on that platform. I’ve also confirmed that udp-conduit does not pose any problem.
Rafael Larrosa told me that he faced the same issue last month and that after spending more than two weeks tackling the problem he is almost sure that there is a bug (or a maximum buffer size limitation) in the ibv-conduit implementation of gasnet. For his code, he found a turning point when the transferred buffer was 128MBytes: smaller communications work fine, but larger always fail. He says it was tricky, because when you try to isolate the problem (i.e. isolate the particular transfer that fails by executing just this single communication) then the problem vanish. So it will be challenging to chase this bug. You may want to circumvent the bug by: 1.- Not using bulkComms optimization (-suseBulkTransferStride=false -suseBulkTransfer=false). —> Slower comms. 2.- Implementing a version of bulkComms that splits big messages into smaller ones. —> wearisome tinkering 3.- Avoid ibv-conduit —> 207 out of the 500 supercomputers in latest top500 list are based on ibv 4.- Dive into the ibv-conduit implementation —> Probably not your main research goal For the time being we are conducting all our experiments on Cray machines, so we do not plan (and do not have time) to tackle 2 or 4, so we are getting by with 3. If Rafael wants to chime in, he can probably give you more details and advices, should you want to debug your code at a lower level. Regards, Rafa. El 28/01/2014, a las 19:31, Akihiro Hayashi <[email protected]> escribió: > Hi, Rafael, > > Sorry for the delayed reply. > Let me share the program that reproduces the problem. (attached below) > > As you can see, the program prints "INVALID? : true" if we get bulk copy > transfer error, otherwise it prints "INVALID?: false". > I get the error when I run the program on 2 locales with ibv-conduit > (mpi-spawner). The input data size is : matrixSize = 2000 and tileSize = 200. > Please let me know if you want the input file. > Note that I don't get the error when I run the program on 1 locale. In > addition, I don't get the error with smaller data size even on 2 or more > locales (e.g 10x10 matrix and 2x2 tile size). > I'm guessing using ibv-conduit and transferring a certain amount of data > incurs this problem. > FYI, using udp-conduit (amudprun) does not show the error. > > Please let me know if you have any comments and questions. > > Best, > > Akihiro > > -- > > use BlockDist; > > config const matrixSize: int(32) = -1; > config const tileSize: int(32) = -1; > config const inFile: string = "m_2000.in"; > const zero: int(32) = 0; > var tile_array_indices = {zero..tileSize-1,zero..tileSize-1}; > > class Tile { > var tile_array: [tile_array_indices] real; > } > > proc read_2D_array ( fileName: string, matrixSize: int(32) ) { > var input_stream = open (fileName, iomode.r); > var reader = input_stream.reader(); > var matrix_index_2D = {0..matrixSize-1, 0..matrixSize-1}; > var array: [matrix_index_2D] real; > > for ij in matrix_index_2D do { > reader.read(array(ij)); > } > input_stream.close(); > reader.close(); > // if (debug) { writeln("whole array: ",array); } > return array; > } > > proc main(): void { > writeln("numLocales : ", numLocales); > > var numTiles: int(32) = matrixSize/tileSize; > var numTiles_2: int(64) = matrixSize/tileSize; > > var whole_array = read_2D_array(inFile, matrixSize); > > var proto_ijk_space = {zero..numTiles_2-1, zero..numTiles_2, > zero..numTiles_2}; > var ijk_space = proto_ijk_space dmapped Block(boundingBox=proto_ijk_space); > var lkji_tiles: [ijk_space] Tile; > > for i in zero..numTiles-1 do { > for j in zero..i do { > on lkji_tiles(i,j,zero).locale do { > var curr_tile: Tile = new Tile(); > for (ii,jj) in tile_array_indices do { > curr_tile.tile_array(ii,jj) = > whole_array(i*tileSize+ii,j*tileSize+jj); > } > lkji_tiles(i,j,zero) = curr_tile; > } > } > } > var invalid : bool = false; > for i in zero..numTiles-1 do { > for iB in zero..tileSize-1 do { > for j in zero..i do { > var temp = lkji_tiles(i,j,zero).tile_array; > if(i != j) { > for jB in zero..tileSize-1 do { > if (temp(iB,jB) != lkji_tiles(i, j, > zero).tile_array(iB, jB)) { > invalid = true; > } > } > } else { > for jB in zero..iB do { > if (temp(iB,jB) != lkji_tiles(i, j, > zero).tile_array(iB, jB)) { > invalid = true; > } > } > } > } > } > } > writeln("INVALID? : ", invalid); > > } > On Jan 22, 2014, at 1:46 PM, Akihiro Hayashi wrote: > >> Hi Rafael, >> >> Thanks for your reply. >> >> I inlined my comments below: >> >>> May we have a simplified copy of your code (kinda the snippet provided >>> below but with initial values for tileSize, numTiles_2, k, etc. i.e. >>> something that compiles) so that we can also give it a go? >> Yes, it would be better if we can have a simplified code. >> Actually, I have been trying to make a simple code that reproduce this >> problem for several weeks. finally I managed to make it this morning. >> Let me ask my advisor if we can show you the code. >> >>> Would you like to try also with these flags?: >>> >>> -suseBulkTransferStride=true -suseBulkTransfer=false >> I tried these flags, but I still get the error. >> >> I'll keep you updated. >> >> Best, >> >> Akihiro >> >> On Jan 22, 2014, at 5:23 AM, Rafael Asenjo Plaza wrote: >> >>> Hi Akihiro, >>> >>> May we have a simplified copy of your code (kinda the snippet provided >>> below but with initial values for tileSize, numTiles_2, k, etc. i.e. >>> something that compiles) so that we can also give it a go? >>> >>> Would you like to try also with these flags?: >>> >>> -suseBulkTransferStride=true -suseBulkTransfer=false >>> >>> Thank you, >>> >>> Rafa. >>> >>> El 21/01/2014, a las 18:33, Akihiro Hayashi <[email protected]> escribió: >>> >>>> Dear Chapel developers, >>>> >>>> This is Akihiro Hayashi, postdoc at Rice University. >>>> I'm writing this to ask array copy failure in chapel. >>>> >>>> I'm trying to evaluate some chapel benchmark across multiple nodes but I >>>> get strange error. >>>> Please note that I'm using old version of chapel compiler (r21945) with >>>> qthread-1.10 and GASNet-1.20.2(infiniband-conduit, mpi-spawner) because >>>> the latest version does not work. >>>> With the latest version of chapel compiler (r22568) with qthread-1.10 and >>>> GASNet-1.22.0(infiniband-conduit, mpi-spawner), I get SEGV when running >>>> simple program (coforall loc in Locales do on loc { writeln(loc); }) >>>> across multiple nodes with mpi spawner. >>>> This is another problem but I have not investigated this problem yet. I'll >>>> work on this later. >>>> >>>> The following problem might be fixed in the latest version, but any >>>> comments and suggestions are appreciated. >>>> Here is part of my code. >>>> The main data structure is a 3-dimensional array, which is declared as a >>>> distributed array that each of its element refers to a 2-dimension array. >>>> You can see array copy statement (liBlock = >>>> lkji_tiles(k,k,k+1).tile_array;) in Line 11. I want to use this copy >>>> statement because the Chapel compiler generates bulk transfer code, which >>>> accelerates program execution. >>>> >>>> // Code >>>> 1: const zero: int(32) = 0; >>>> 2: var tile_array_indices = {zero..tileSize-1,zero..tileSize-1}; >>>> 3: class Tile { >>>> 4: var tile_array: [tile_array_indices] real; >>>> 5: } >>>> 6: var proto_ijk_space = {zero..numTiles_2-1, zero..numTiles_2, >>>> zero..numTiles_2}; >>>> 7: var ijk_space = proto_ijk_space dmapped >>>> Block(boundingBox=proto_ijk_space); >>>> 8: var lkji_tiles: [ijk_space] Tile; >>>> ... >>>> 9 :begin { >>>> ... >>>> 10: var liBlock: [tile_array_indices] real; >>>> 11: liBlock = lkji_tiles(k,k,k+1).tile_array; >>>> 12: for (m,n) in tile_array_indices { >>>> 13: if (liBlock(m,n) != lkji_tiles(k,k,k+1).tile_array(m,n)) { >>>> 14: invalid = true; >>>> 15: } >>>> 16: } >>>> 17: if (invalid) { writln("Copy Failed");} >>>> 18: ... >>>> 19: } >>>> ... >>>> >>>> In my experiment, when running the program on 2 or more locales, the >>>> program prints "Copy Failed" which means "liBlock = >>>> lkji_tiles(k,k,k+1).tile_array;" in Line 11 failed. >>>> This happens sometime (not always). and I confirmed the copy is >>>> successfully done if I replace the array copy in Line 11 with copy loop. >>>> Additionally, I also see the same behavior when I replace the array copy >>>> in Line 11 with >>>> liBlock._value.doiBulkTransfer(lkji_tiles(k,k,k+1).tile_array);. >>>> >>>> Here is an output log at runtime when I compile the program with -s >>>> debugBulkTransfer (tileSize=200): >>>> >>>> -- Log starts here >>>> In DefaultRectangularArr.doiBulkTransfer(): Alo=(0, 0), Blo=(0, 0), >>>> len=40000, elemSize=8; >>>> -- End of Log >>>> >>>> In both cases, the runtime internally calls chpl_comm_get API(*) and the >>>> API takes the above parameters. >>>> I think it looks good. >>>> (*) Please take a look at doiBulkTransfer function in >>>> CHPL_HOME/modules/internal/DefaultRectangular.chpl >>>> >>>> Any comments and suggestions are appreciated. >>>> >>>> Best regards, >>>> >>>> Akihiro >>>> ------------------------------------------------------------------------------ >>>> CenturyLink Cloud: The Leader in Enterprise Cloud Services. >>>> Learn Why More Businesses Are Choosing CenturyLink Cloud For >>>> Critical Workloads, Development Environments & Everything In Between. >>>> Get a Quote or Start a Free Trial Today. >>>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk >>>> _______________________________________________ >>>> Chapel-developers mailing list >>>> [email protected] >>>> https://lists.sourceforge.net/lists/listinfo/chapel-developers >>> >>> __ >>> Rafael Asenjo Plaza >>> Dept. Arquitectura de Computadores >>> Complejo Tecnologico Campus de Teatinos >>> E-29071 MALAGA (SPAIN) >>> Tel: +34 95 213 27 91 >>> Fax: +34 95 213 27 90 >>> http://www.ac.uma.es/~asenjo >>> >>> >> > __ Rafael Asenjo Plaza Dept. Arquitectura de Computadores Complejo Tecnologico Campus de Teatinos E-29071 MALAGA (SPAIN) Tel: +34 95 213 27 91 Fax: +34 95 213 27 90 http://www.ac.uma.es/~asenjo ------------------------------------------------------------------------------ WatchGuard Dimension instantly turns raw network data into actionable security intelligence. It gives you real-time visual feedback on key security issues and trends. Skip the complicated setup - simply import a virtual appliance and go from zero to informed in seconds. http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk _______________________________________________ Chapel-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/chapel-developers
