Hi Michael,

________________________________________
From: Michael Ferguson [[email protected]]
Sent: 08 June 2015 16:55
To: Panagiotopoulou, Konstantina
Cc: [email protected]
Subject: Re: [Chapel-developers] chpl_wide_EndCount

Hi Konstantina -

(sending again this time CC'ing the mailing list)

On 6/6/15, 11:38 AM, "Panagiotopoulou, Konstantina" <[email protected]> wrote:

>B. When using serial local execution (chpl_ftable_call (..))
>I get :
>(PROGRAM OUTPUT) arg= 1----------on locale 0  //execution of foo()
>function
>(MODULE) SUB:: 0 :: started
>(MODULE) SUB:: 0 :: in  e!=nil
>(COMM LAYER) 0 chpl_comm_fork::  Loc 0 -> Loc 457826136
>....
>[0] /usr/bin/gstack 20149
>
>Now since the parent of locale 1 is locale 0 (here) I would expect that
>the wide_endCount pointer would point to the endCount in local memory.
>Instead it points to 457826136 (corrupted memory I suppose) , reads this
>as a locale ID and tries to do a remote sub on that. Eventually it gives
>a seg fault.
>
>So I am confused.
>Shouldn't locale 0 be able to read the correct wide_EndCount pointer
>since:
>1. it is copied from the args sent to locale 1 and
>2. the endCount lives in local memory
>??


> In the beginning, I though that the .locale part of the wide__Endcount
> points to the child locale (rather than the parent). In my case, the
> child has failed, so I though that I could access it and instead write
> the current locale's id - the locale that performs the recovery.
> Otherwise, I don't really need to call __primitive("get end count").

> ftable_call is the naive serial recovery. It seems that this one has
> the least problems. The task executes normally and since it is serial
> I use the atomic_counter from the RTS to decrement task count
> without problems, but the main waits on the remote task on the
> failed locale to complete, until gasnet timeout occurs.



To be clear -  I believe that these endCounts are allocated on
the parent, decremented on the child, and waited for in the parent.

So, in your example, I'd expect you could move the work on Locale 1,
but that you'd have to preserve the end count where it is on Locale 0
since that task is waiting for it. If you decremented it early
(because Locale 1 failed) - you'd cause the program on Locale 0 to
continue even though its child task was not complete, which probably
isn't what you wanted (if you're transparently redirecting the failed
work).


That's what I am thinking too. I was hoping that since there is a pointer 
(initially on locale 1 and then on the locale that the task is redirected to)
I thought that the remote on would work fine. 
I even tried to add an explicit on in the downEndCount funtion, following the 
one in upEndCount. Sth like:

if (e==nill) then 
    on e {  e.i.sub(...)}

but this also launches on a "wrong" locale id and gives a segfault after a while

I wouldn't try to change the locale portion of the end count
pointer, since there's probably a task waiting on it wherever
the end count is stored.

Yeah, I have started thinking of alternatives.  

Cheers,

-michael


Cheers,

Konstantina




----- 
We invite research leaders and ambitious early career researchers to 
join us in leading and driving research in key inter-disciplinary themes. 
Please see www.hw.ac.uk/researchleaders for further information and how
to apply.

Heriot-Watt University is a Scottish charity
registered under charity number SC000278.


------------------------------------------------------------------------------
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Reply via email to