Re: [Cython] cython.parallel tasks, single, master, critical, barriers

Dag Sverre Seljebotn Thu, 20 Oct 2011 01:42:35 -0700

Meta: I've been meaning to respond to this thread, but can't find thetime. What's the time-frame for implementing this? If it's hypotheticalat the moment and just is a question of getting things spec-ed, onecould perhaps look at discussing it at the next Cython workshop, orperhaps a Skype call with the three of us as some point...

Regarding the tasks: One of my biggest problems with Python is the lackof an elegant syntax for anonymous functions. But since Python has thatproblem, I feel it is not necesarrily something we should fix (by usingthe with statements to create tasks). Sometimes Pythonic-ness is moreimportant than elegance (for Cython).

In general I'm happy as long as there's a chance of getting things towork in pure Python mode as well (with serial execution). So if, e.g.,with statements creating tasks have the same effect when running thesame code (serially) in pure Python, I'm less opposed (didn't look at itin detail).


Dag Sverre

On 10/19/2011 09:53 PM, mark florisson wrote:

On 19 October 2011 06:01, Robert Bradshaw<[email protected]>  wrote:

On Fri, Oct 14, 2011 at 1:07 PM, mark florisson
<[email protected]>  wrote:

On 14 October 2011 19:31, Robert Bradshaw<[email protected]>  wrote:

On Wed, Oct 12, 2011 at 7:55 AM, mark florisson
<[email protected]>  wrote:

I ultimately feel things like that is more important than 100% coverage of
the OpenMP standard. Of course, OpenMP is a lot lower-hanging fruit.


+1 Prange handles the (corse-grained) SIMD case nicely, and a
task/futures model based on closures would I think flesh this out to
the next level of generality (and complexity).


Futures are definitely nice. I suppose I think really like "inline
futures", i.e. openmp tasks. I realize that futures may look more
pythonic. However, as mentioned previously, I also see issues with
that. When you submit a task then you expect a future object, which
you might want to pass around. But we don't have the GIL for that. I
personally feel that futures is something that should be done by a
library (such as concurrent.futures in python 3.2), and inline tasks
by a language. It also means I have to write an entire function or
closure for perhaps only a few lines of code.

I might also want to submit other functions that are not closures, or
I might want to reuse my closures that are used for tasks and for
something else. So what if my tasks contain more parallel constructs?
e.g. what if I have a task closure that I return from my function that
generates more tasks itself? Would you just execute them sequentially
outside of the parallel construct, or would you simply disallow that?
Also, do you restrict future "objects" to only the parallel section?

Another problem is that you can only wait on tasks of your direct
children. So what if I get access to my parent's future object
(assuming you allow tasks to generate tasks), and then want the result
of my parent?
Or what if I store these future objects in an array or list and access
them arbitrarily? You will only know at runtime which task to wait on,
and openmp only has a static, lexical taskwait.

I suppose my point is that without either a drastic rewrite (e.g., use
pthreads instead of openmp) or quite a bit of contraints, I am unsure
how futures would work here. Perhaps you guys have some concrete
syntax and semantics proposals?


It feels to me that OpenMP tasks took a different model of parallelism
and tried to force them into the OpenMP model/constraints, and so it'd
be even more difficult to fit them into a nice pythonic interface.
Perhaps to make progress on this front we need to have a concrete
example to look at. I'm also wondering if the standard threading
module (perhaps with overlay support) used with nogil functions would
be sufficient--locking is required for handling the queues, etc. so
the fact that the GIL is involved is not a big deal. It is possible
that this won't scale to as small of work units, but the overhead
should be minimal once your work unit is a sufficient size (which is
probably quite small) and it's already implemented and well
documented/used.


It's all definitely possible with normal threads, but the thing you
lose is convenience and conciseness. For big problems the programmer
might sum up the courage and effort to implement it, but typically you
will just stick to a serial version. This is really where OpenMP is
powerful, you can take a simple sequential piece of code and make it
parallel with minimal effort and without having to restructure,
rethink and rewrite your algorithms.


That is a very good point.

Something like concurrent.futures is definitely nice, but most people
cannot afford to mandate python 3.2 for their users.

The most classical examples I can think of for tasks are

1) independent code sections, i.e. two or more pieces of code that
don't depend on each other which you want to execute in parallel
2) traversal of some kind of custom data structure, like a tree or a linked list
3) some kind of other producer/consumer model

e.g. using with task syntax:

cdef postorder_traverse(tree *t): # bullet 1) and 2)
    with task:
        traverse(t.left)
    with task:
        traverse(t.right)

    taskwait() # wait until we traversed our subtrees
    use(t.data)


Is there an implicit parallel block here? Perhaps in the caller?

cdef list_traverse(linkedlist *L): # bullet 2)
    with nogil, parallel():
        if threadid() == 0:
            while L.next:
                with task:
                    do_something(L.data)

In the latter case we don't need a taskwait as we don't care about any
particular order. Only one thread generates the tasks where the others
just hit the barrier and see the tasks they can execute.


I guess it's the fact that Python doesn't have a nice syntax for
anonymous functions or blocks does make this syntax more appealing
than an explicit closure.

Perhaps if we came up with a more pythonic/natural name which would
make the intent clear. Makes me want to do something like

pool = ThreadPool(10)
for item in L:
    with pool:
        process(item)

but then you get into issues of passing the pool around. OpenMP has
the implicit pool of the nesting parallel block, so "with one thread"
or "with cython.parallel.pool" or something like that might be more
readable.

The good thing is that the OpenMP runtime can decide at task
generation point (not only at taskwait or barrier points!) decide to
stop generating more tasks and start executing them. So you won't
exhaust memory if you might have lots of tasks.


Often threadpools have queues that block when their buffer gets full
to achieve the same goal.

As for critical and barrier, the notion of a critical block as a with
statement is very useful. Creating/naming locks (rather than being
implicit on the file/line number) is more powerful, but is a larger
burden on the user and more difficult to support with the OpenMP
backend.


Actually, as I mentioned before, critical sections do not at all
depend on their line or file number. All they depend on their implicit
or explicit name (the name is implicit when you simply omit it, so all
unnamed critical sections exclude each other).


Ah, yes. In this case "with cython.parallel.lock([optional name])"
could be obvious enough.


We could also support atomic updates. We could either rewrite
parallel.lock() blocks to atomics if all statements use inplace
operators, but that might actually not be safe as the exclusion might
be used for the rhs expressions. So I think you'd want a
parallel.atomic() directive or some such.
Alternatively, if you support parallel.shared(), you could specify
that inplace operators on any such variables would actually be atomic
updates, even if you use the operators on the elements of the shared
variable. e.g.

cdef int array1[N]
cdef int array2[N]
with parallel(), shared(array1):
     # atomic update
     array1[i] += ...

     # not an atomic update, as it is "implicitly shared"
     array2[i] += ...

I'm not sure if that's more confusing than enlightening though.

Indeed, supporting creation of locks dynamically and allowing them to
be passed around arbitrarily would be hard (and likely not worth the
effort). Naming them is trivial though, which might not be incredibly
pythonic but is very convenient, easy and readable.


You can view this as a lookup by name, not a lock creation. Not
allowing them to be used outside of a with clause is a reasonable
restriction, and does not preclude a (possibly very distant) extension
to being able to pass them around.

barrier, if supported, should be a function call not a
context. Not as critical as with the tasks case, but a good example to
see how it flows would be useful here as well.


I agree, it really doesn't have any associated code and trying to
associate code with it is likely more confusing than meaningful. It
was just an idea.
Often you can rely on implicit barriers from e.g. prange, but not
always. I can't think of any real-world example, but you usually need
it to ensure that everyone gets a sane view on some shared data, e.g.

with nogil, parallel():
    array[threadid()] = func(threadid())
    barrier()
    use array[threadid() + 1 % omp_num_threads()] # access data of
some neighbour

This is a rather contrived example, but (see below) it would be
especially useful if you use single/master/once/first that sets some
shared data everyone will operate on (for instance in a prange). To
ensure the data is sane before you use it, you have to put the barrier
to 1) ensure the data has been written and 2) that the data has been
flushed.

Basically, you'll always know when you need a barrier, but it's pretty
hard to come up with a real-world example for it when you have to :)


Yes, I think barriers are explanatory enough.

As for single, I see doing this manually does require boilerplate
locking, so what about

if cython.parallel.once():  # will return True once for a tread group.
    ...

we could implement this via our own locking/checking/flushing to allow
it to occur in arbitrary expressions, e.g.

special_worker = cython.parallel.once()
if special_worker:
   ...
[common code]
if special_worker:   # single wouldn't work here
   ...


That looks OK. I've actually been thinking that if we have barriers we
don't really need is_master(), once() or single() or anything. We
already have threadid() and you usually don't care what thread gets
there first, you only care about doing it once. So one could just
write

if parallel.threadid() == 0:
    ...

parallel.barrier() # if required


Perhaps you want the first free thread to take it up to minimize idle
threads. I agree if parallel.threadid() == 0 is a synonym for
is_master(), so probably not needed. However, what are the OpenMP
semantics of

cdef f():
    with parallel():
        g()
        g()

cdef g():
    with single():
        ... # executed once, right?
    with task:
        ... # executed twice, right?

It might also be convenient to declare variables explicitly shared
here, e.g. this code will not work:

cdef int *buf

with nogil, parallel.parallel():
    if parallel.threadid() == 0:
        buf = ...

    parallel.barrier()

    # will will likely segfault, as buf is private because we assigned
to it. It's only valid in thread 0
    use buf[...]

So basically you'd have to do something like (&buf)[0][...], which
frankly looks pretty weird. However I do think such cases are rather
uncommon.


True. Perhaps this could be declared via "with nogil,
parallel.parallel(), parallel.shared(buf)" or something like that.

- Robert
_______________________________________________
cython-devel mailing list
[email protected]
http://mail.python.org/mailman/listinfo/cython-devel

_______________________________________________
cython-devel mailing list
[email protected]
http://mail.python.org/mailman/listinfo/cython-devel


_______________________________________________
cython-devel mailing list
[email protected]
http://mail.python.org/mailman/listinfo/cython-devel

Re: [Cython] cython.parallel tasks, single, master, critical, barriers

Reply via email to