Hi, Brad, Thank you for your suggestions.
> I'm always glad when we can pass the blame to something other than Chapel. :) Exactly :) I tried --enable-debug flag at GASNet configuration time and I rebuilt my Chapel compiler and runtime. Then I compile and run the program with GASNET_BACKTRACE=1, But i don't get any message. Some debug messages are supposed to be shown during program execution if we configure GASNet with --enable-debug flag? Please let me know if you have any suggestions. Best, Akihiro On Jan 29, 2014, at 2:40 PM, Brad Chamberlain wrote: > > Hi guys -- > > While I'm sorry about the time spent on this issue, I'm always glad when we > can pass the blame to something other than Chapel. :) > > Something I'm wondering is whether, if GASNet is built with runtime checks on > (I think this is done using --enable-debug at GASNet configuration time), > will the ibv-conduit issue show up as a runtime assertion or something other > than a bug? I think we may want to pass along a report of this issue to the > GASNet team, in which case this is undoubtedly the first question they'll ask. > > It would also be great if we had a standalone C+GASNet program that exhibited > the issue (as they're not deeply invested in Chapel), but if that isn't a > 15-minute exercise, we can tell them how to reproduce in Chapel. > > Thanks, > -Brad > > > On Wed, 29 Jan 2014, Akihiro Hayashi wrote: > >> Hi, Rafael, >> >> Thanks for your reply. I inlined my comments below: >> >>> I’ve been talking to Rafael Larrosa regarding the issue you are reporting. >>> He has conducted some experiments with your code on Titan (Cray XK7 at >>> ORNL, which features Cray Gemini interconnect) and there are no >>> communication problem there. However, we have here in Malaga a cluster >>> based on Infiniband (ibv-conduit) and the execution fails on that platform. >>> I’ve also confirmed that udp-conduit does not pose any problem. >> I really appreciate your and Rafael Larrosa's experiments. I'm glad to hear >> that this is not a problem in Chapel compiler. >> >>> Rafael Larrosa told me that he faced the same issue last month and that >>> after spending more than two weeks tackling the problem he is almost sure >>> that there is a bug (or a maximum buffer size limitation) in the >>> ibv-conduit implementation of gasnet. For his code, he found a turning >>> point when the transferred buffer was 128MBytes: smaller communications >>> work fine, but larger always fail. He says it was tricky, because when you >>> try to isolate the problem (i.e. isolate the particular transfer that fails >>> by executing just this single communication) then the problem vanish. So it >>> will be challenging to chase this bug. >> >> Sure, now I understand there is a bug in the ibv-conduit implementation of >> gasnet. and yes, It seems fixing the bug is very difficult. Actually, an >> original benchmark I want to run spawns many tasks by begin statement and >> each task does bulk transfer. I can imagine the benchmark exceeds some limit >> like Rafael's code. I'm also wondering why the simplified code has this >> problem. There might be another problem. >> >>> You may want to circumvent the bug by: >>> >>> 1.- Not using bulkComms optimization (-suseBulkTransferStride=false >>> -suseBulkTransfer=false). —> Slower comms. >>> 2.- Implementing a version of bulkComms that splits big messages into >>> smaller ones. —> wearisome tinkering >>> 3.- Avoid ibv-conduit —> 207 out of the 500 supercomputers in latest top500 >>> list are based on ibv >>> 4.- Dive into the ibv-conduit implementation —> Probably not your main >>> research goal >>> >>> For the time being we are conducting all our experiments on Cray machines, >>> so we do not plan (and do not have time) to tackle 2 or 4, so we are >>> getting by with 3. >> Exactly, 4 is not my research goal, I 'd choose 3 if a benchmark I would >> like to run use bulk transfer. Thanks for your suggestions. >> >>> If Rafael wants to chime in, he can probably give you more details and >>> advices, should you want to debug your code at a lower level. >> >> I would appreciate if he could give me more details. I think I should >> mention the bug in my paper or something. >> >> Best, >> >> Akihiro >> >> On Jan 29, 2014, at 4:46 AM, Rafael Asenjo Plaza wrote: >> >>> Hi Akihiro, >>> >>> I’ve been talking to Rafael Larrosa regarding the issue you are reporting. >>> He has conducted some experiments with your code on Titan (Cray XK7 at >>> ORNL, which features Cray Gemini interconnect) and there are no >>> communication problem there. However, we have here in Malaga a cluster >>> based on Infiniband (ibv-conduit) and the execution fails on that platform. >>> I’ve also confirmed that udp-conduit does not pose any problem. >>> >>> Rafael Larrosa told me that he faced the same issue last month and that >>> after spending more than two weeks tackling the problem he is almost sure >>> that there is a bug (or a maximum buffer size limitation) in the >>> ibv-conduit implementation of gasnet. For his code, he found a turning >>> point when the transferred buffer was 128MBytes: smaller communications >>> work fine, but larger always fail. He says it was tricky, because when you >>> try to isolate the problem (i.e. isolate the particular transfer that fails >>> by executing just this single communication) then the problem vanish. So it >>> will be challenging to chase this bug. >>> >>> You may want to circumvent the bug by: >>> >>> 1.- Not using bulkComms optimization (-suseBulkTransferStride=false >>> -suseBulkTransfer=false). —> Slower comms. >>> 2.- Implementing a version of bulkComms that splits big messages into >>> smaller ones. —> wearisome tinkering >>> 3.- Avoid ibv-conduit —> 207 out of the 500 supercomputers in latest top500 >>> list are based on ibv >>> 4.- Dive into the ibv-conduit implementation —> Probably not your main >>> research goal >>> >>> For the time being we are conducting all our experiments on Cray machines, >>> so we do not plan (and do not have time) to tackle 2 or 4, so we are >>> getting by with 3. >>> >>> If Rafael wants to chime in, he can probably give you more details and >>> advices, should you want to debug your code at a lower level. >>> >>> Regards, >>> >>> Rafa. >>> >>> El 28/01/2014, a las 19:31, Akihiro Hayashi <[email protected]> escribió: >>> >>>> Hi, Rafael, >>>> >>>> Sorry for the delayed reply. >>>> Let me share the program that reproduces the problem. (attached below) >>>> >>>> As you can see, the program prints "INVALID? : true" if we get bulk copy >>>> transfer error, otherwise it prints "INVALID?: false". >>>> I get the error when I run the program on 2 locales with ibv-conduit >>>> (mpi-spawner). The input data size is : matrixSize = 2000 and tileSize = >>>> 200. Please let me know if you want the input file. >>>> Note that I don't get the error when I run the program on 1 locale. In >>>> addition, I don't get the error with smaller data size even on 2 or more >>>> locales (e.g 10x10 matrix and 2x2 tile size). >>>> I'm guessing using ibv-conduit and transferring a certain amount of data >>>> incurs this problem. >>>> FYI, using udp-conduit (amudprun) does not show the error. >>>> >>>> Please let me know if you have any comments and questions. >>>> >>>> Best, >>>> >>>> Akihiro >>>> >>>> -- >>>> >>>> use BlockDist; >>>> >>>> config const matrixSize: int(32) = -1; >>>> config const tileSize: int(32) = -1; >>>> config const inFile: string = "m_2000.in"; >>>> const zero: int(32) = 0; >>>> var tile_array_indices = {zero..tileSize-1,zero..tileSize-1}; >>>> >>>> class Tile { >>>> var tile_array: [tile_array_indices] real; >>>> } >>>> >>>> proc read_2D_array ( fileName: string, matrixSize: int(32) ) { >>>> var input_stream = open (fileName, iomode.r); >>>> var reader = input_stream.reader(); >>>> var matrix_index_2D = {0..matrixSize-1, 0..matrixSize-1}; >>>> var array: [matrix_index_2D] real; >>>> >>>> for ij in matrix_index_2D do { >>>> reader.read(array(ij)); >>>> } >>>> input_stream.close(); >>>> reader.close(); >>>> // if (debug) { writeln("whole array: ",array); } >>>> return array; >>>> } >>>> >>>> proc main(): void { >>>> writeln("numLocales : ", numLocales); >>>> >>>> var numTiles: int(32) = matrixSize/tileSize; >>>> var numTiles_2: int(64) = matrixSize/tileSize; >>>> >>>> var whole_array = read_2D_array(inFile, matrixSize); >>>> >>>> var proto_ijk_space = {zero..numTiles_2-1, zero..numTiles_2, >>>> zero..numTiles_2}; >>>> var ijk_space = proto_ijk_space dmapped Block(boundingBox=proto_ijk_space); >>>> var lkji_tiles: [ijk_space] Tile; >>>> >>>> for i in zero..numTiles-1 do { >>>> for j in zero..i do { >>>> on lkji_tiles(i,j,zero).locale do { >>>> var curr_tile: Tile = new Tile(); >>>> for (ii,jj) in tile_array_indices do { >>>> curr_tile.tile_array(ii,jj) = >>>> whole_array(i*tileSize+ii,j*tileSize+jj); >>>> } >>>> lkji_tiles(i,j,zero) = curr_tile; >>>> } >>>> } >>>> } >>>> var invalid : bool = false; >>>> for i in zero..numTiles-1 do { >>>> for iB in zero..tileSize-1 do { >>>> for j in zero..i do { >>>> var temp = lkji_tiles(i,j,zero).tile_array; >>>> if(i != j) { >>>> for jB in zero..tileSize-1 do { >>>> if (temp(iB,jB) != lkji_tiles(i, j, >>>> zero).tile_array(iB, jB)) { >>>> invalid = true; >>>> } >>>> } >>>> } else { >>>> for jB in zero..iB do { >>>> if (temp(iB,jB) != lkji_tiles(i, j, >>>> zero).tile_array(iB, jB)) { >>>> invalid = true; >>>> } >>>> } >>>> } >>>> } >>>> } >>>> } >>>> writeln("INVALID? : ", invalid); >>>> >>>> } >>>> On Jan 22, 2014, at 1:46 PM, Akihiro Hayashi wrote: >>>> >>>>> Hi Rafael, >>>>> >>>>> Thanks for your reply. >>>>> >>>>> I inlined my comments below: >>>>> >>>>>> May we have a simplified copy of your code (kinda the snippet provided >>>>>> below but with initial values for tileSize, numTiles_2, k, etc. i.e. >>>>>> something that compiles) so that we can also give it a go? >>>>> Yes, it would be better if we can have a simplified code. >>>>> Actually, I have been trying to make a simple code that reproduce this >>>>> problem for several weeks. finally I managed to make it this morning. >>>>> Let me ask my advisor if we can show you the code. >>>>> >>>>>> Would you like to try also with these flags?: >>>>>> >>>>>> -suseBulkTransferStride=true -suseBulkTransfer=false >>>>> I tried these flags, but I still get the error. >>>>> >>>>> I'll keep you updated. >>>>> >>>>> Best, >>>>> >>>>> Akihiro >>>>> >>>>> On Jan 22, 2014, at 5:23 AM, Rafael Asenjo Plaza wrote: >>>>> >>>>>> Hi Akihiro, >>>>>> >>>>>> May we have a simplified copy of your code (kinda the snippet provided >>>>>> below but with initial values for tileSize, numTiles_2, k, etc. i.e. >>>>>> something that compiles) so that we can also give it a go? >>>>>> >>>>>> Would you like to try also with these flags?: >>>>>> >>>>>> -suseBulkTransferStride=true -suseBulkTransfer=false >>>>>> >>>>>> Thank you, >>>>>> >>>>>> Rafa. >>>>>> >>>>>> El 21/01/2014, a las 18:33, Akihiro Hayashi <[email protected]> escribió: >>>>>> >>>>>>> Dear Chapel developers, >>>>>>> >>>>>>> This is Akihiro Hayashi, postdoc at Rice University. >>>>>>> I'm writing this to ask array copy failure in chapel. >>>>>>> >>>>>>> I'm trying to evaluate some chapel benchmark across multiple nodes but >>>>>>> I get strange error. >>>>>>> Please note that I'm using old version of chapel compiler (r21945) with >>>>>>> qthread-1.10 and GASNet-1.20.2(infiniband-conduit, mpi-spawner) because >>>>>>> the latest version does not work. >>>>>>> With the latest version of chapel compiler (r22568) with qthread-1.10 >>>>>>> and GASNet-1.22.0(infiniband-conduit, mpi-spawner), I get SEGV when >>>>>>> running simple program (coforall loc in Locales do on loc { >>>>>>> writeln(loc); }) across multiple nodes with mpi spawner. >>>>>>> This is another problem but I have not investigated this problem yet. >>>>>>> I'll work on this later. >>>>>>> >>>>>>> The following problem might be fixed in the latest version, but any >>>>>>> comments and suggestions are appreciated. >>>>>>> Here is part of my code. >>>>>>> The main data structure is a 3-dimensional array, which is declared as >>>>>>> a distributed array that each of its element refers to a 2-dimension >>>>>>> array. >>>>>>> You can see array copy statement (liBlock = >>>>>>> lkji_tiles(k,k,k+1).tile_array;) in Line 11. I want to use this copy >>>>>>> statement because the Chapel compiler generates bulk transfer code, >>>>>>> which accelerates program execution. >>>>>>> >>>>>>> // Code >>>>>>> 1: const zero: int(32) = 0; >>>>>>> 2: var tile_array_indices = {zero..tileSize-1,zero..tileSize-1}; >>>>>>> 3: class Tile { >>>>>>> 4: var tile_array: [tile_array_indices] real; >>>>>>> 5: } >>>>>>> 6: var proto_ijk_space = {zero..numTiles_2-1, zero..numTiles_2, >>>>>>> zero..numTiles_2}; >>>>>>> 7: var ijk_space = proto_ijk_space dmapped >>>>>>> Block(boundingBox=proto_ijk_space); >>>>>>> 8: var lkji_tiles: [ijk_space] Tile; >>>>>>> ... >>>>>>> 9 :begin { >>>>>>> ... >>>>>>> 10: var liBlock: [tile_array_indices] real; >>>>>>> 11: liBlock = lkji_tiles(k,k,k+1).tile_array; >>>>>>> 12: for (m,n) in tile_array_indices { >>>>>>> 13: if (liBlock(m,n) != lkji_tiles(k,k,k+1).tile_array(m,n)) { >>>>>>> 14: invalid = true; >>>>>>> 15: } >>>>>>> 16: } >>>>>>> 17: if (invalid) { writln("Copy Failed");} >>>>>>> 18: ... >>>>>>> 19: } >>>>>>> ... >>>>>>> >>>>>>> In my experiment, when running the program on 2 or more locales, the >>>>>>> program prints "Copy Failed" which means "liBlock = >>>>>>> lkji_tiles(k,k,k+1).tile_array;" in Line 11 failed. >>>>>>> This happens sometime (not always). and I confirmed the copy is >>>>>>> successfully done if I replace the array copy in Line 11 with copy loop. >>>>>>> Additionally, I also see the same behavior when I replace the array >>>>>>> copy in Line 11 with >>>>>>> liBlock._value.doiBulkTransfer(lkji_tiles(k,k,k+1).tile_array);. >>>>>>> >>>>>>> Here is an output log at runtime when I compile the program with -s >>>>>>> debugBulkTransfer (tileSize=200): >>>>>>> >>>>>>> -- Log starts here >>>>>>> In DefaultRectangularArr.doiBulkTransfer(): Alo=(0, 0), Blo=(0, 0), >>>>>>> len=40000, elemSize=8; >>>>>>> -- End of Log >>>>>>> >>>>>>> In both cases, the runtime internally calls chpl_comm_get API(*) and >>>>>>> the API takes the above parameters. >>>>>>> I think it looks good. >>>>>>> (*) Please take a look at doiBulkTransfer function in >>>>>>> CHPL_HOME/modules/internal/DefaultRectangular.chpl >>>>>>> >>>>>>> Any comments and suggestions are appreciated. >>>>>>> >>>>>>> Best regards, >>>>>>> >>>>>>> Akihiro >>>>>>> ------------------------------------------------------------------------------ >>>>>>> CenturyLink Cloud: The Leader in Enterprise Cloud Services. >>>>>>> Learn Why More Businesses Are Choosing CenturyLink Cloud For >>>>>>> Critical Workloads, Development Environments & Everything In Between. >>>>>>> Get a Quote or Start a Free Trial Today. >>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk >>>>>>> _______________________________________________ >>>>>>> Chapel-developers mailing list >>>>>>> [email protected] >>>>>>> https://lists.sourceforge.net/lists/listinfo/chapel-developers >>>>>> >>>>>> __ >>>>>> Rafael Asenjo Plaza >>>>>> Dept. Arquitectura de Computadores >>>>>> Complejo Tecnologico Campus de Teatinos >>>>>> E-29071 MALAGA (SPAIN) >>>>>> Tel: +34 95 213 27 91 >>>>>> Fax: +34 95 213 27 90 >>>>>> http://www.ac.uma.es/~asenjo >>>>>> >>>>>> >>>>> >>>> >>> >>> __ >>> Rafael Asenjo Plaza >>> Dept. Arquitectura de Computadores >>> Complejo Tecnologico Campus de Teatinos >>> E-29071 MALAGA (SPAIN) >>> Tel: +34 95 213 27 91 >>> Fax: +34 95 213 27 90 >>> http://www.ac.uma.es/~asenjo >>> >>> >> >> >> ------------------------------------------------------------------------------ >> WatchGuard Dimension instantly turns raw network data into actionable >> security intelligence. It gives you real-time visual feedback on key >> security issues and trends. Skip the complicated setup - simply import >> a virtual appliance and go from zero to informed in seconds. >> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk >> _______________________________________________ >> Chapel-developers mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/chapel-developers ------------------------------------------------------------------------------ WatchGuard Dimension instantly turns raw network data into actionable security intelligence. It gives you real-time visual feedback on key security issues and trends. Skip the complicated setup - simply import a virtual appliance and go from zero to informed in seconds. http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk _______________________________________________ Chapel-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/chapel-developers
