On 19 October 2011 19:19, mark florisson <markflorisso...@gmail.com> wrote: > On 19 October 2011 06:01, Robert Bradshaw <rober...@math.washington.edu> > wrote: >> On Fri, Oct 14, 2011 at 1:07 PM, mark florisson >> <markflorisso...@gmail.com> wrote: >>> On 14 October 2011 19:31, Robert Bradshaw <rober...@math.washington.edu> >>> wrote: >>>> On Wed, Oct 12, 2011 at 7:55 AM, mark florisson >>>> <markflorisso...@gmail.com> wrote: >>>>>>> I ultimately feel things like that is more important than 100% coverage >>>>>>> of >>>>>>> the OpenMP standard. Of course, OpenMP is a lot lower-hanging fruit. >>>>>> >>>>>> +1 Prange handles the (corse-grained) SIMD case nicely, and a >>>>>> task/futures model based on closures would I think flesh this out to >>>>>> the next level of generality (and complexity). >>>>> >>>>> Futures are definitely nice. I suppose I think really like "inline >>>>> futures", i.e. openmp tasks. I realize that futures may look more >>>>> pythonic. However, as mentioned previously, I also see issues with >>>>> that. When you submit a task then you expect a future object, which >>>>> you might want to pass around. But we don't have the GIL for that. I >>>>> personally feel that futures is something that should be done by a >>>>> library (such as concurrent.futures in python 3.2), and inline tasks >>>>> by a language. It also means I have to write an entire function or >>>>> closure for perhaps only a few lines of code. >>>>> >>>>> I might also want to submit other functions that are not closures, or >>>>> I might want to reuse my closures that are used for tasks and for >>>>> something else. So what if my tasks contain more parallel constructs? >>>>> e.g. what if I have a task closure that I return from my function that >>>>> generates more tasks itself? Would you just execute them sequentially >>>>> outside of the parallel construct, or would you simply disallow that? >>>>> Also, do you restrict future "objects" to only the parallel section? >>>>> >>>>> Another problem is that you can only wait on tasks of your direct >>>>> children. So what if I get access to my parent's future object >>>>> (assuming you allow tasks to generate tasks), and then want the result >>>>> of my parent? >>>>> Or what if I store these future objects in an array or list and access >>>>> them arbitrarily? You will only know at runtime which task to wait on, >>>>> and openmp only has a static, lexical taskwait. >>>>> >>>>> I suppose my point is that without either a drastic rewrite (e.g., use >>>>> pthreads instead of openmp) or quite a bit of contraints, I am unsure >>>>> how futures would work here. Perhaps you guys have some concrete >>>>> syntax and semantics proposals? >>>> >>>> It feels to me that OpenMP tasks took a different model of parallelism >>>> and tried to force them into the OpenMP model/constraints, and so it'd >>>> be even more difficult to fit them into a nice pythonic interface. >>>> Perhaps to make progress on this front we need to have a concrete >>>> example to look at. I'm also wondering if the standard threading >>>> module (perhaps with overlay support) used with nogil functions would >>>> be sufficient--locking is required for handling the queues, etc. so >>>> the fact that the GIL is involved is not a big deal. It is possible >>>> that this won't scale to as small of work units, but the overhead >>>> should be minimal once your work unit is a sufficient size (which is >>>> probably quite small) and it's already implemented and well >>>> documented/used. >>> >>> It's all definitely possible with normal threads, but the thing you >>> lose is convenience and conciseness. For big problems the programmer >>> might sum up the courage and effort to implement it, but typically you >>> will just stick to a serial version. This is really where OpenMP is >>> powerful, you can take a simple sequential piece of code and make it >>> parallel with minimal effort and without having to restructure, >>> rethink and rewrite your algorithms. >> >> That is a very good point. >> >>> Something like concurrent.futures is definitely nice, but most people >>> cannot afford to mandate python 3.2 for their users. >>> >>> The most classical examples I can think of for tasks are >>> >>> 1) independent code sections, i.e. two or more pieces of code that >>> don't depend on each other which you want to execute in parallel >>> 2) traversal of some kind of custom data structure, like a tree or a linked >>> list >>> 3) some kind of other producer/consumer model >>> >>> e.g. using with task syntax: >>> >>> cdef postorder_traverse(tree *t): # bullet 1) and 2) >>> with task: >>> traverse(t.left) >>> with task: >>> traverse(t.right) >>> >>> taskwait() # wait until we traversed our subtrees >>> use(t.data) >> >> Is there an implicit parallel block here? Perhaps in the caller? > > Yes, it was implicit in my example. If you'd use that code, you'd call > it from a parallel section. Depending on what semantics you'd define > (see below), you'd call it either from one thread in the team, or with > all of them. > >>> cdef list_traverse(linkedlist *L): # bullet 2) >>> with nogil, parallel(): >>> if threadid() == 0: >>> while L.next: >>> with task: >>> do_something(L.data) >>> >>> In the latter case we don't need a taskwait as we don't care about any >>> particular order. Only one thread generates the tasks where the others >>> just hit the barrier and see the tasks they can execute. >> >> I guess it's the fact that Python doesn't have a nice syntax for >> anonymous functions or blocks does make this syntax more appealing >> than an explicit closure. >> >> Perhaps if we came up with a more pythonic/natural name which would >> make the intent clear. Makes me want to do something like >> >> pool = ThreadPool(10) >> for item in L: >> with pool: >> process(item) >> >> but then you get into issues of passing the pool around. OpenMP has >> the implicit pool of the nesting parallel block, so "with one thread" >> or "with cython.parallel.pool" or something like that might be more >> readable. > > I think with pool would be good, it must be clear that the task is > submitted to a threadpool and hence may be executed asynchronously. > >>> The good thing is that the OpenMP runtime can decide at task >>> generation point (not only at taskwait or barrier points!) decide to >>> stop generating more tasks and start executing them. So you won't >>> exhaust memory if you might have lots of tasks. >> >> Often threadpools have queues that block when their buffer gets full >> to achieve the same goal. >> >>>> As for critical and barrier, the notion of a critical block as a with >>>> statement is very useful. Creating/naming locks (rather than being >>>> implicit on the file/line number) is more powerful, but is a larger >>>> burden on the user and more difficult to support with the OpenMP >>>> backend. >>> >>> Actually, as I mentioned before, critical sections do not at all >>> depend on their line or file number. All they depend on their implicit >>> or explicit name (the name is implicit when you simply omit it, so all >>> unnamed critical sections exclude each other). >> >> Ah, yes. In this case "with cython.parallel.lock([optional name])" >> could be obvious enough. >> >>> Indeed, supporting creation of locks dynamically and allowing them to >>> be passed around arbitrarily would be hard (and likely not worth the >>> effort). Naming them is trivial though, which might not be incredibly >>> pythonic but is very convenient, easy and readable. >> >> You can view this as a lookup by name, not a lock creation. Not >> allowing them to be used outside of a with clause is a reasonable >> restriction, and does not preclude a (possibly very distant) extension >> to being able to pass them around. >> >>>> barrier, if supported, should be a function call not a >>>> context. Not as critical as with the tasks case, but a good example to >>>> see how it flows would be useful here as well. >>> >>> I agree, it really doesn't have any associated code and trying to >>> associate code with it is likely more confusing than meaningful. It >>> was just an idea. >>> Often you can rely on implicit barriers from e.g. prange, but not >>> always. I can't think of any real-world example, but you usually need >>> it to ensure that everyone gets a sane view on some shared data, e.g. >>> >>> with nogil, parallel(): >>> array[threadid()] = func(threadid()) >>> barrier() >>> use array[threadid() + 1 % omp_num_threads()] # access data of >>> some neighbour >>> >>> This is a rather contrived example, but (see below) it would be >>> especially useful if you use single/master/once/first that sets some >>> shared data everyone will operate on (for instance in a prange). To >>> ensure the data is sane before you use it, you have to put the barrier >>> to 1) ensure the data has been written and 2) that the data has been >>> flushed. >>> >>> Basically, you'll always know when you need a barrier, but it's pretty >>> hard to come up with a real-world example for it when you have to :) >> >> Yes, I think barriers are explanatory enough. >> >>>> As for single, I see doing this manually does require boilerplate >>>> locking, so what about >>>> >>>> if cython.parallel.once(): # will return True once for a tread group. >>>> ... >>>> >>>> we could implement this via our own locking/checking/flushing to allow >>>> it to occur in arbitrary expressions, e.g. >>>> >>>> special_worker = cython.parallel.once() >>>> if special_worker: >>>> ... >>>> [common code] >>>> if special_worker: # single wouldn't work here >>>> ... >>>> >>> >>> That looks OK. I've actually been thinking that if we have barriers we >>> don't really need is_master(), once() or single() or anything. We >>> already have threadid() and you usually don't care what thread gets >>> there first, you only care about doing it once. So one could just >>> write >>> >>> if parallel.threadid() == 0: >>> ... >>> >>> parallel.barrier() # if required >> >> Perhaps you want the first free thread to take it up to minimize idle >> threads. I agree if parallel.threadid() == 0 is a synonym for >> is_master(), so probably not needed. However, what are the OpenMP >> semantics of >> >> cdef f(): >> with parallel(): >> g() >> g() >> >> cdef g(): >> with single(): >> ... # executed once, right? >> with task: >> ... # executed twice, right? > > Hmm, not quite. The thing is that function g is called by every thread > in the team, say N threads, and for each time the team encounters the > single directive, it will execute it once, so in total it will execute > the code in the single block twice, as the team encounters it twice. > > It will however create 2N tasks to execute, as every thread that > encounters it creates a task. This is probably not what you want, so > you usually want > > with parallel(): > if threadid() == 0: > g() > > and have the code in g (executed by one thread only) create the tasks. > > Note also how 'for _ in prange(1):' would not have the same semantics > here, as it generates a 'parallel for' and not a worksharing for in > the function (because we don't support orphaned pranges). > > I think this may all be confusing for users, I think usually you will > want to create just a single task irrespective of whether you are in a > parallel or a prange and not "however many threads are in the team for > parallel and just one for prange because we're sharing work". This > would also work for orphaned tasks, e.g. you expect 2 tasks in your > snippet above, not 2N. Fortunately, that would be easy to support. > We would however have to introduce the same restriction as with > (implicit) barriers: either all or none of the threads must encounter > the construct (or maybe loosen it to "if you actually want to create > the task, make sure at least thread 0 encounters it", which may lead > users to write more efficient code). > >>> It might also be convenient to declare variables explicitly shared >>> here, e.g. this code will not work: >>> >>> cdef int *buf >>> >>> with nogil, parallel.parallel(): >>> if parallel.threadid() == 0: >>> buf = ... >>> >>> parallel.barrier() >>> >>> # will will likely segfault, as buf is private because we assigned >>> to it. It's only valid in thread 0 >>> use buf[...] >>> >>> So basically you'd have to do something like (&buf)[0][...], which >>> frankly looks pretty weird. However I do think such cases are rather >>> uncommon. >> >> True. Perhaps this could be declared via "with nogil, >> parallel.parallel(), parallel.shared(buf)" or something like that. > > That looks elegant enough.
Likewise, I think something like parallel.private(buf) would also be really nice for arrays, especially if we also allow arrays with runtime sizes (behind the scenes we could malloc and free). I think those cases are much more common than parallel.shared(). >> - Robert >> _______________________________________________ >> cython-devel mailing list >> cython-devel@python.org >> http://mail.python.org/mailman/listinfo/cython-devel >> > _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel