Hi Konstantina -

(sending again this time CC'ing the mailing list)

On 6/6/15, 11:38 AM, "Panagiotopoulou, Konstantina" <[email protected]> wrote:

>B. When using serial local execution (chpl_ftable_call (..))
>I get :
>(PROGRAM OUTPUT) arg= 1----------on locale 0  //execution of foo()
>function
>(MODULE) SUB:: 0 :: started
>(MODULE) SUB:: 0 :: in  e!=nil
>(COMM LAYER) 0 chpl_comm_fork::  Loc 0 -> Loc 457826136
>.... 
>[0] /usr/bin/gstack 20149
>
>Now since the parent of locale 1 is locale 0 (here) I would expect that
>the wide_endCount pointer would point to the endCount in local memory.
>Instead it points to 457826136 (corrupted memory I suppose) , reads this
>as a locale ID and tries to do a remote sub on that. Eventually it gives
>a seg fault.
>
>So I am confused. 
>Shouldn't locale 0 be able to read the correct wide_EndCount pointer
>since:
>1. it is copied from the args sent to locale 1 and
>2. the endCount lives in local memory
>??


> In the beginning, I though that the .locale part of the wide__Endcount
> points to the child locale (rather than the parent). In my case, the
> child has failed, so I though that I could access it and instead write
> the current locale's id - the locale that performs the recovery.
> Otherwise, I don't really need to call __primitive("get end count").

> ftable_call is the naive serial recovery. It seems that this one has
> the least problems. The task executes normally and since it is serial
> I use the atomic_counter from the RTS to decrement task count
> without problems, but the main waits on the remote task on the
> failed locale to complete, until gasnet timeout occurs.



To be clear -  I believe that these endCounts are allocated on
the parent, decremented on the child, and waited for in the parent.

So, in your example, I'd expect you could move the work on Locale 1,
but that you'd have to preserve the end count where it is on Locale 0
since that task is waiting for it. If you decremented it early
(because Locale 1 failed) - you'd cause the program on Locale 0 to
continue even though its child task was not complete, which probably
isn't what you wanted (if you're transparently redirecting the failed
work).

I wouldn't try to change the locale portion of the end count
pointer, since there's probably a task waiting on it wherever
the end count is stored.

Cheers,

-michael



------------------------------------------------------------------------------
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Reply via email to