Some missing things in the current threading implementation

Sönke Ludwig Sun, 12 Sep 2010 06:40:23 -0700

Recently I thought it would be a good idea to try out the newconcurrency system once again. Some time back, when 'shared' was stillnew, I already tried it several times but since it was completelyunusable I gave up on it at that time (and as it seems, many others alsodid this).

Now, however, after TDPL has been released and there is somedocumentation + std.concurrency, the system should be in a state whereit is actually useful and only some bugs should be there to fix - whichdoes not include inherent system changes. The reality is quite differentonce you step anywhere beside the already walked path (defined by thebook examples and similar things).

Just for the record, I've done a lot with most kinds of threadingschemes (even if the only lockless thing I implemented was a simpleShared/WeakPtr implementation *shiver*). This may very well have theeffect that there are some patterns burned into my head that somehowclash with some ideas behind the current system. But, for most of thepoints, I am quite sure that there is no viable alternative ifperformance and memory consumption should be anywhere new the optimum.

I apologize for the length of this post, although I already tried tomake it as short as possible and left out a lot of details. Also it isvery possible that I assume some false things about the concurrencyimplementation because my knowledge is mostly based only on the NG andthe book chapter.

The following problems are those that I found during a one day endeavorto convert some parts of my code base to spawn/shared (not reallysuccessful, partly because of the very viral nature of shared).



1. spawn and objects

Spawn only supports 'function' + some bound parameters. Since takingthe address of an object method in D always yields a delegate, it is notpossible to call class members without a static wrapper function. Thiscan be quite disturbing when working object oriented (C++ obviously hasthe same problem).



2. error messages

Right now, error messages just state that there is a shared/unsharedmismatch somewhere. For a non-shared-expert, this can be a real bummer.You have to know a lot of implications 'shared' has to be able tocorrectly interpret these messages and track down the cause. Not verygood for a feature that is meant to make threading easier.



3. everything in implicit

This may seem kind of counter-intuitive, but using 'synchronized'classes and features like setSameMutex - which are deadly necessary, itis stupid to neglect the importance of lock based threading in an objectoriented environment - creates a feeling of climbing without a safetyrope. Not stating how you really want to synchronize/lock and not beingable to directly read from the code how this is really done just leavesa black-box feeling. This in turn means threading newcomers will not beeducated, they just use the system somehow and it magically works. Butas soon as you get problems such as deadlocks, you suddenly have tounderstand the details and in this moment you have to read up andremember everything that is going on in the background - plus everythingyou would have to know about threading/synchronization in C. I'm notsure if this is the right course here or if there is any better one.



4. steep learning curve - more a high learning wall to climb on

Resulting from the first points, my feeling tells me that a newcomer,who has not followed the discussions and thoughts about the system here,will see himself standing before a very high barrier of material tolearn, before he can actually put anything of it to use. Also I imaginethis to be a very painful process because of all the things that youdiscover are not possible or those error messages that potentially makeyou banging your head against the wall.


        
5. advanced synchronization primitives need to be considered

Things such as core.sync.condition (the most important one) need to beconsidered in the 'shared'-system. This means there needs to be acondition variable that takes a shared object instead of a mutex or youhave to be able to query an objects mutex.


        
6. temporary unlock

There are often situations when you do lock-based programming, in whichyou need to temporarily unlock your mutex, perform some time consumingexternal task (disk i/o, ...) and then reaquire the mutex. For thisfeature, which is really important also because it is really difficultand dirty to work around it, needs language support, could be somethinglike the inverse of a synchronized {} scope or the possibility to definea special kind of private member function that unlocks the mutex. Then,inside whose blocks the compiler of course has to make sure that theappropriate access rules are not broken (could be as conservative asdisallowing access to any class member).


        
7. optimization of pseudo-shared objects

Since the sharability/'synchronized' of an object is already decided atclass definition time, for performance reasons it should be possible tosomehow disable the mutex of those instances that are only used threadlocally. Maybe it should be necessary to declare objects as "shared Cc;" even if the class is defined as "synchronized class C {}" or youwill get an object without a mutex which is not shared?


        
8. very strong split of shared and non-shared worlds

For container classes in particular it is really nasty that you have todefine two versions of the container, one shared and the othernon-shared if you want to be able to use it in both contexts and be ableto put non-shared objects in it in a non-shared context. Also thereshould really be a way to declare a class to be hygienic in a waysimilar to pure, so that it would be possible to allow it to be used ina synchronized context and store shared objects, although it is notshared itself.


        
9. unique

Unique objects or chunks of data are really important not only to beable to check that a cast to 'immutable' is correct, but also to allowfor passing objects to another thread for computations without making asuperfluous copy or doing superfluous computation.


        
10. recursive locking

The only option right now is to have mutexes behave recursively. Thisis good to easily avoid deadlocks in the same thread. However, in myexperience they are very dangerous because typically no one takes intoaccount what happens when an algorithm is invoked recursively from themiddle of its computation. This can happen easily in a threadedenvironment where you often use signals/slots or message passing. Adeadlock or at least an assertion in debug mode is a good indicator in90% of the situations that there just happened something that shouldnot. Of course objects with shared mutexes are a different matter - inthis case you actually need to have an ownership relation to do anythinguseful with non-recursive mutexes.


        
11. holes in the system

It seems like there are a lot of ways in which you can still slip innon-shared data into a shared context.


        One example is that you can pass a shared array
        ---
                void fnc(int[] arr);
                void fnc2(){
                        shared int[] arr;
                        spawn(&fnc, arr);
                }
        ---
        
        compiles. This is just a bug and probably easy to fix but what about:
        
        ---
                class C {
                        private void method();
                        private void method2(){
                                spawn( void function(C inst){ inst.method(); }, 
this );
                        }
                }
        ---

unless private functions to recursive locking (which in turn is usuallyuseless overhead), method() will be invoked in a completely unprotectedcontext. Tthis one has to be fixed somehow in the language. I'm surethere are other things like these.


        
12. more practical examples need to be considered

It seems right now, that all the examples, that are used to explore thefeatures needed in the system, are somehow of a very academical nature.Either the most simple i/o or pure functional comptation, maybe anetwork protocol. However, when it comes to practical high performancecomputation on real systems, where memory consumption and low-levelperformance can really matter, there seems to be quite some no-mans-landhere.

        
        Here some simple examples where I immediately came to a grinding halt:
        
        I. A an object loader with background processing

You have a shared class Loader which uses multiple threads to loadobjects on demand and then fires a signal or returns from itsloadObject(x) method.

The problem is that the actual loading of an object must happenoutside of a synchronized region of the loader or you get no parallelismout of this. Also, you have to use an external function because of'spawn' instead of being able to directly use a member function.Fortunately in this case this is also the solution. Defining an externalfunction, that takes the arguments needed to load the object, loadingit, and then passing it back to the class.Waiting for finished objects can be implemented using message passingwithout worry here because the MP overhead is probably low enough.

                
                Features missing:
                        - spawn with methods
                        - temporary unlock
                        
        II. Implementation of a ThreadPool

The majority of applications can very well be broken up into smallchunks of work that can be processed in parallel. Instead of using acostly thread-create, run task, thread-destroy cycle, it would be wiseto reuse the threads for later tasks. The implementation of a threadpool that does this is of course a low-level thing and you could arguethat it is ok to use some casts and such stuff here. Anyway, there arequite some things missing here.

                
                Features Missing:
                        - spawn with methods
                        - temporary unlock

- condition variables (message passing too slow + you need to managedestinations)


        III. multiple threads computing separate parts of an array

Probably the most simple form of parallelism is to perform similaroperations on each element of an array (or similar things on regions ofthe array) and to do this in separate threads.The good news is that this works in the current implementation. Thebad news is that this is really slow because you have to use atomicoperations on the elements or it is unsafe and prone to low-level races.Right now the compiler checks almost nothing.

                The alternative would be to pass unique
                
                To illustrate the current situation, this compiles and runs:

                ---
                        import std.concurrency;
                        import std.stdio;

                        void doCompute(size_t offset, int[] arr){ // arr should 
be shared
                                foreach(i, ref el; arr){

el *= 2; // should be an atomic operation, which would make thisuseless because of the performance penaltywritefln("Thread %s computed element %d: %d", thisTid(), i +offset, cast(int)el);

                                }
                        }

                        void waitForThread(Tid thread){

// TODO: implement in some complex way using messages or maybe thereis a simple function for this

                        }

                        void main(){
                                shared int[] myarray = [1, 2, 3, 4];
                                Tid[2] threads;
                                foreach( i, ref t; threads )

t = spawn(&doCompute, i, myarray[i .. i+3]); // should error outbecause the slice is not shared

                                foreach( t; threads )
                                        waitForThread(t);
                        }
                ---
                
                Features missing:
                        - unique

- some way to safely partition/slice an array and get a set of stillunique slices



- Sönke

Some missing things in the current threading implementation

Reply via email to