Hi, Brad,

Thank you for your suggestions. 

> I'm always glad when we can pass the blame to something other than Chapel.  :)
Exactly :)

I tried --enable-debug flag at GASNet configuration time and I rebuilt my 
Chapel compiler and runtime.
Then I compile and run the program with GASNET_BACKTRACE=1,  But i don't get 
any message. 
Some debug messages are supposed to be shown during program execution if we 
configure GASNet with --enable-debug flag?

Please let me know if you have any suggestions.

Best,

Akihiro

On Jan 29, 2014, at 2:40 PM, Brad Chamberlain wrote:

> 
> Hi guys --
> 
> While I'm sorry about the time spent on this issue, I'm always glad when we 
> can pass the blame to something other than Chapel.  :)
> 
> Something I'm wondering is whether, if GASNet is built with runtime checks on 
> (I think this is done using --enable-debug at GASNet configuration time), 
> will the ibv-conduit issue show up as a runtime assertion or something other 
> than a bug?  I think we may want to pass along a report of this issue to the 
> GASNet team, in which case this is undoubtedly the first question they'll ask.
> 
> It would also be great if we had a standalone C+GASNet program that exhibited 
> the issue (as they're not deeply invested in Chapel), but if that isn't a 
> 15-minute exercise, we can tell them how to reproduce in Chapel.
> 
> Thanks,
> -Brad
> 
> 
> On Wed, 29 Jan 2014, Akihiro Hayashi wrote:
> 
>> Hi, Rafael,
>> 
>> Thanks for your reply. I inlined my comments below:
>> 
>>> I’ve been talking to Rafael Larrosa regarding the issue you are reporting. 
>>> He has conducted some experiments with your code on Titan (Cray XK7 at 
>>> ORNL, which features Cray Gemini interconnect) and there are no 
>>> communication problem there. However, we have here in Malaga a cluster 
>>> based on Infiniband (ibv-conduit) and the execution fails on that platform. 
>>> I’ve also confirmed that udp-conduit does not pose any problem.
>> I really appreciate your and Rafael Larrosa's experiments. I'm glad to hear 
>> that this is not a problem in Chapel compiler.
>> 
>>> Rafael Larrosa told me that he faced the same issue last month and that 
>>> after spending more than two weeks tackling the problem he is almost sure 
>>> that there is a bug (or a maximum buffer size limitation) in the 
>>> ibv-conduit implementation of gasnet. For his code, he found a turning 
>>> point when the transferred buffer was 128MBytes: smaller communications 
>>> work fine, but larger always fail. He says it was tricky, because when you 
>>> try to isolate the problem (i.e. isolate the particular transfer that fails 
>>> by executing just this single communication) then the problem vanish. So it 
>>> will be challenging to chase this bug.
>> 
>> Sure, now I understand there is a bug in the ibv-conduit implementation of 
>> gasnet. and yes, It seems fixing the bug is very difficult.  Actually, an 
>> original benchmark I want to run spawns many tasks by begin statement and 
>> each task does bulk transfer. I can imagine the benchmark exceeds some limit 
>> like Rafael's code.  I'm also wondering why the simplified code has this 
>> problem. There might be another problem.
>> 
>>> You may want to circumvent the bug by:
>>> 
>>> 1.- Not using bulkComms optimization (-suseBulkTransferStride=false 
>>> -suseBulkTransfer=false). —> Slower comms.
>>> 2.- Implementing a version of bulkComms that splits big messages into 
>>> smaller ones. —> wearisome tinkering
>>> 3.- Avoid ibv-conduit —> 207 out of the 500 supercomputers in latest top500 
>>> list are based on ibv
>>> 4.- Dive into the ibv-conduit implementation —> Probably not your main 
>>> research goal
>>> 
>>> For the time being we are conducting all our experiments on Cray machines, 
>>> so we do not plan (and do not have time) to tackle 2 or 4, so we are 
>>> getting by with 3.
>> Exactly, 4 is not my research goal, I 'd choose 3 if a benchmark I would 
>> like to run use bulk transfer. Thanks for your suggestions.
>> 
>>> If Rafael wants to chime in, he can probably give you more details and 
>>> advices, should you want to debug your code at a lower level.
>> 
>> I would appreciate if he could give me more details. I think I should 
>> mention the bug in my paper or something.
>> 
>> Best,
>> 
>> Akihiro
>> 
>> On Jan 29, 2014, at 4:46 AM, Rafael Asenjo Plaza wrote:
>> 
>>> Hi Akihiro,
>>> 
>>> I’ve been talking to Rafael Larrosa regarding the issue you are reporting. 
>>> He has conducted some experiments with your code on Titan (Cray XK7 at 
>>> ORNL, which features Cray Gemini interconnect) and there are no 
>>> communication problem there. However, we have here in Malaga a cluster 
>>> based on Infiniband (ibv-conduit) and the execution fails on that platform. 
>>> I’ve also confirmed that udp-conduit does not pose any problem.
>>> 
>>> Rafael Larrosa told me that he faced the same issue last month and that 
>>> after spending more than two weeks tackling the problem he is almost sure 
>>> that there is a bug (or a maximum buffer size limitation) in the 
>>> ibv-conduit implementation of gasnet. For his code, he found a turning 
>>> point when the transferred buffer was 128MBytes: smaller communications 
>>> work fine, but larger always fail. He says it was tricky, because when you 
>>> try to isolate the problem (i.e. isolate the particular transfer that fails 
>>> by executing just this single communication) then the problem vanish. So it 
>>> will be challenging to chase this bug.
>>> 
>>> You may want to circumvent the bug by:
>>> 
>>> 1.- Not using bulkComms optimization (-suseBulkTransferStride=false 
>>> -suseBulkTransfer=false). —> Slower comms.
>>> 2.- Implementing a version of bulkComms that splits big messages into 
>>> smaller ones. —> wearisome tinkering
>>> 3.- Avoid ibv-conduit —> 207 out of the 500 supercomputers in latest top500 
>>> list are based on ibv
>>> 4.- Dive into the ibv-conduit implementation —> Probably not your main 
>>> research goal
>>> 
>>> For the time being we are conducting all our experiments on Cray machines, 
>>> so we do not plan (and do not have time) to tackle 2 or 4, so we are 
>>> getting by with 3.
>>> 
>>> If Rafael wants to chime in, he can probably give you more details and 
>>> advices, should you want to debug your code at a lower level.
>>> 
>>> Regards,
>>> 
>>> Rafa.
>>> 
>>> El 28/01/2014, a las 19:31, Akihiro Hayashi <[email protected]> escribió:
>>> 
>>>> Hi, Rafael,
>>>> 
>>>> Sorry for the delayed reply.
>>>> Let me share the program that reproduces the problem. (attached below)
>>>> 
>>>> As you can see, the program prints "INVALID? : true" if we get bulk copy 
>>>> transfer error, otherwise it prints "INVALID?: false".
>>>> I get the error when I run the program on 2 locales with ibv-conduit 
>>>> (mpi-spawner). The input data size is : matrixSize = 2000 and tileSize = 
>>>> 200. Please let me know if you want the input file.
>>>> Note that I don't get the error when I run the program on 1 locale. In 
>>>> addition, I don't get the error with smaller data size even on 2 or more 
>>>> locales (e.g 10x10 matrix and 2x2 tile size).
>>>> I'm guessing using ibv-conduit and transferring a certain amount of data 
>>>> incurs this problem.
>>>> FYI, using udp-conduit (amudprun) does not show the error.
>>>> 
>>>> Please let me know if you have any comments and questions.
>>>> 
>>>> Best,
>>>> 
>>>> Akihiro
>>>> 
>>>> --
>>>> 
>>>> use BlockDist;
>>>> 
>>>> config const matrixSize: int(32) = -1;
>>>> config const   tileSize: int(32) = -1;
>>>> config const     inFile: string = "m_2000.in";
>>>> const zero: int(32) = 0;
>>>> var tile_array_indices = {zero..tileSize-1,zero..tileSize-1};
>>>> 
>>>> class Tile {
>>>> var tile_array: [tile_array_indices] real;
>>>> }
>>>> 
>>>> proc read_2D_array ( fileName: string, matrixSize: int(32) ) {
>>>> var input_stream = open (fileName, iomode.r);
>>>> var reader = input_stream.reader();
>>>> var matrix_index_2D = {0..matrixSize-1, 0..matrixSize-1};
>>>> var array: [matrix_index_2D] real;
>>>> 
>>>> for ij in matrix_index_2D do {
>>>>     reader.read(array(ij));
>>>> }
>>>> input_stream.close();
>>>> reader.close();
>>>> // if (debug) { writeln("whole array: ",array); }
>>>> return array;
>>>> }
>>>> 
>>>> proc main(): void {
>>>> writeln("numLocales : ", numLocales);
>>>> 
>>>> var numTiles: int(32) = matrixSize/tileSize;
>>>> var numTiles_2: int(64) = matrixSize/tileSize;
>>>> 
>>>> var whole_array = read_2D_array(inFile, matrixSize);
>>>> 
>>>> var proto_ijk_space = {zero..numTiles_2-1, zero..numTiles_2, 
>>>> zero..numTiles_2};
>>>> var ijk_space = proto_ijk_space dmapped Block(boundingBox=proto_ijk_space);
>>>> var lkji_tiles: [ijk_space] Tile;
>>>> 
>>>> for i in zero..numTiles-1 do {
>>>>     for j in zero..i do {
>>>>         on lkji_tiles(i,j,zero).locale do {
>>>>             var curr_tile: Tile = new Tile();
>>>>            for (ii,jj) in tile_array_indices do {
>>>>                 curr_tile.tile_array(ii,jj) = 
>>>> whole_array(i*tileSize+ii,j*tileSize+jj);
>>>>            }
>>>>             lkji_tiles(i,j,zero) = curr_tile;
>>>>         }
>>>>    }
>>>> }
>>>> var invalid : bool = false;
>>>> for i in zero..numTiles-1 do {
>>>>    for iB in zero..tileSize-1 do {
>>>>         for j in zero..i do {
>>>>            var temp = lkji_tiles(i,j,zero).tile_array;
>>>>            if(i != j) {
>>>>                 for jB in zero..tileSize-1 do {
>>>>                    if (temp(iB,jB) != lkji_tiles(i, j, 
>>>> zero).tile_array(iB, jB)) {
>>>>                         invalid = true;
>>>>                    }
>>>>                 }
>>>>            } else {
>>>>                 for jB in zero..iB do {
>>>>                    if (temp(iB,jB) != lkji_tiles(i, j, 
>>>> zero).tile_array(iB, jB)) {
>>>>                         invalid = true;
>>>>                    }
>>>>                 }
>>>>            }
>>>>         }
>>>>    }
>>>> }
>>>> writeln("INVALID? : ", invalid);
>>>> 
>>>> }
>>>> On Jan 22, 2014, at 1:46 PM, Akihiro Hayashi wrote:
>>>> 
>>>>> Hi Rafael,
>>>>> 
>>>>> Thanks for your reply.
>>>>> 
>>>>> I inlined my comments below:
>>>>> 
>>>>>> May we have a simplified copy of your code (kinda the snippet provided 
>>>>>> below but with initial values for tileSize, numTiles_2, k, etc. i.e. 
>>>>>> something that compiles) so that we can also give it a go?
>>>>> Yes, it would be better if we can have a simplified code.
>>>>> Actually, I have been trying to make a simple code that reproduce this 
>>>>> problem for several weeks. finally I managed to make it this morning.
>>>>> Let me ask my advisor if we can show you the code.
>>>>> 
>>>>>> Would you like to try also with these flags?:
>>>>>> 
>>>>>> -suseBulkTransferStride=true -suseBulkTransfer=false
>>>>> I tried these flags, but I still get the error.
>>>>> 
>>>>> I'll keep you updated.
>>>>> 
>>>>> Best,
>>>>> 
>>>>> Akihiro
>>>>> 
>>>>> On Jan 22, 2014, at 5:23 AM, Rafael Asenjo Plaza wrote:
>>>>> 
>>>>>> Hi Akihiro,
>>>>>> 
>>>>>> May we have a simplified copy of your code (kinda the snippet provided 
>>>>>> below but with initial values for tileSize, numTiles_2, k, etc. i.e. 
>>>>>> something that compiles) so that we can also give it a go?
>>>>>> 
>>>>>> Would you like to try also with these flags?:
>>>>>> 
>>>>>> -suseBulkTransferStride=true -suseBulkTransfer=false
>>>>>> 
>>>>>> Thank you,
>>>>>> 
>>>>>> Rafa.
>>>>>> 
>>>>>> El 21/01/2014, a las 18:33, Akihiro Hayashi <[email protected]> escribió:
>>>>>> 
>>>>>>> Dear Chapel developers,
>>>>>>> 
>>>>>>> This is Akihiro Hayashi, postdoc at Rice University.
>>>>>>> I'm writing this to ask array copy failure in chapel.
>>>>>>> 
>>>>>>> I'm trying to evaluate some chapel benchmark across multiple nodes but 
>>>>>>> I get strange error.
>>>>>>> Please note that I'm using old version of chapel compiler (r21945) with 
>>>>>>> qthread-1.10 and GASNet-1.20.2(infiniband-conduit, mpi-spawner) because 
>>>>>>> the latest version does not work.
>>>>>>> With the latest version of chapel compiler (r22568) with qthread-1.10 
>>>>>>> and GASNet-1.22.0(infiniband-conduit, mpi-spawner), I get SEGV when 
>>>>>>> running simple program (coforall loc in Locales do on loc { 
>>>>>>> writeln(loc); }) across multiple nodes with mpi spawner.
>>>>>>> This is another problem but I have not investigated this problem yet. 
>>>>>>> I'll work on this later.
>>>>>>> 
>>>>>>> The following problem might be fixed in the latest version, but any 
>>>>>>> comments and suggestions are appreciated.
>>>>>>> Here is part of my code.
>>>>>>> The main data structure is a 3-dimensional array, which is declared as 
>>>>>>> a distributed array that each of its element refers to a 2-dimension 
>>>>>>> array.
>>>>>>> You can see array copy statement (liBlock = 
>>>>>>> lkji_tiles(k,k,k+1).tile_array;) in Line 11. I want to use this copy 
>>>>>>> statement because the Chapel compiler generates bulk transfer code, 
>>>>>>> which accelerates program execution.
>>>>>>> 
>>>>>>> // Code
>>>>>>> 1: const zero: int(32) = 0;
>>>>>>> 2: var tile_array_indices = {zero..tileSize-1,zero..tileSize-1};
>>>>>>> 3: class Tile {
>>>>>>> 4:    var tile_array: [tile_array_indices] real;
>>>>>>> 5: }
>>>>>>> 6: var proto_ijk_space = {zero..numTiles_2-1, zero..numTiles_2, 
>>>>>>> zero..numTiles_2};
>>>>>>> 7: var ijk_space = proto_ijk_space dmapped 
>>>>>>> Block(boundingBox=proto_ijk_space);
>>>>>>> 8: var lkji_tiles: [ijk_space] Tile;
>>>>>>> ...
>>>>>>> 9 :begin {
>>>>>>> ...
>>>>>>> 10:     var liBlock: [tile_array_indices] real;
>>>>>>> 11:     liBlock = lkji_tiles(k,k,k+1).tile_array;
>>>>>>> 12:     for (m,n) in tile_array_indices {
>>>>>>> 13:     if (liBlock(m,n) != lkji_tiles(k,k,k+1).tile_array(m,n)) {
>>>>>>> 14:        invalid = true;
>>>>>>> 15:     }
>>>>>>> 16:   }
>>>>>>> 17:   if (invalid) { writln("Copy Failed");}
>>>>>>> 18:   ...
>>>>>>> 19: }
>>>>>>> ...
>>>>>>> 
>>>>>>> In my experiment, when running the program on 2 or more locales, the 
>>>>>>> program prints "Copy Failed" which means  "liBlock = 
>>>>>>> lkji_tiles(k,k,k+1).tile_array;" in Line 11 failed.
>>>>>>> This happens sometime (not always). and I confirmed the copy is 
>>>>>>> successfully done if I replace the array copy in Line 11 with copy loop.
>>>>>>> Additionally, I also see the same behavior when I replace the array 
>>>>>>> copy in Line 11 with 
>>>>>>> liBlock._value.doiBulkTransfer(lkji_tiles(k,k,k+1).tile_array);.
>>>>>>> 
>>>>>>> Here is an output log at runtime when I compile the program with -s 
>>>>>>> debugBulkTransfer (tileSize=200):
>>>>>>> 
>>>>>>> -- Log starts here
>>>>>>> In DefaultRectangularArr.doiBulkTransfer(): Alo=(0, 0), Blo=(0, 0), 
>>>>>>> len=40000, elemSize=8;
>>>>>>> -- End of Log
>>>>>>> 
>>>>>>> In both cases, the runtime internally calls chpl_comm_get API(*) and 
>>>>>>> the API takes the above parameters.
>>>>>>> I think it looks good.
>>>>>>> (*) Please take a look at doiBulkTransfer function in 
>>>>>>> CHPL_HOME/modules/internal/DefaultRectangular.chpl
>>>>>>> 
>>>>>>> Any comments and suggestions are appreciated.
>>>>>>> 
>>>>>>> Best regards,
>>>>>>> 
>>>>>>> Akihiro
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> CenturyLink Cloud: The Leader in Enterprise Cloud Services.
>>>>>>> Learn Why More Businesses Are Choosing CenturyLink Cloud For
>>>>>>> Critical Workloads, Development Environments & Everything In Between.
>>>>>>> Get a Quote or Start a Free Trial Today.
>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
>>>>>>> _______________________________________________
>>>>>>> Chapel-developers mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.sourceforge.net/lists/listinfo/chapel-developers
>>>>>> 
>>>>>> __
>>>>>> Rafael Asenjo Plaza
>>>>>> Dept. Arquitectura de Computadores
>>>>>> Complejo Tecnologico Campus de Teatinos
>>>>>> E-29071 MALAGA (SPAIN)
>>>>>> Tel: +34 95 213 27 91
>>>>>> Fax: +34 95 213 27 90
>>>>>> http://www.ac.uma.es/~asenjo
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> __
>>> Rafael Asenjo Plaza
>>> Dept. Arquitectura de Computadores
>>> Complejo Tecnologico Campus de Teatinos
>>> E-29071 MALAGA (SPAIN)
>>> Tel: +34 95 213 27 91
>>> Fax: +34 95 213 27 90
>>> http://www.ac.uma.es/~asenjo
>>> 
>>> 
>> 
>> 
>> ------------------------------------------------------------------------------
>> WatchGuard Dimension instantly turns raw network data into actionable
>> security intelligence. It gives you real-time visual feedback on key
>> security issues and trends.  Skip the complicated setup - simply import
>> a virtual appliance and go from zero to informed in seconds.
>> http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Chapel-developers mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/chapel-developers


------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Reply via email to