On Sun, 4 Sep 2016 00:15:15 +0900 Carsten Haitzler (The Rasterman) <ras...@rasterman.com> said:
THIS TIME it's right: https://phab.enlightenment.org/D4281 it's a start. a good one. > Why? This directly impacts design and limitations of EO. It also impacts > performance. It also impacts safety. This is rather important to settle for > 1.19 as this lays the base for the future of EFL. > > So this past week I was trying to improve EOID lookup and found the EOID > tables and caches were not locked so if you accessed any object or created or > deleted objects across 2 or more threads you had race conditions waiting to > happen. No locks! So I added a spinlock. In the tests there should be zero > contention so the lock doesn't spin. > > Result? Cost of EOID lookup (to find the real pointer of the object) went from > 2% to 5% of CPU time. That's how much thread safety costs. It's not cheap. > This got me thinking. "There has to be another way...". > > So first. What is EOID? This is a 32 or 64bit VALUE we place in a pointer (yes > I know technically not portable bizarre architectures - We had to do this to > keep ABI in EFL). This ID gives us an INDIRECT lookup to an object pointer. > Imagine the simplest version in your head is > > Object *obj = all_eo_objects[eoid]; > > So we take the value and check a table. The object pointer is then either a > valid ptr (not NULL) and the object is safe to use, or it's NULL (because eo > maintains the table and ensures any deleted object has it's array member set > to NULL). It is a little more complex than that. It's fragmented into several > levels of smaller tables for memory efficiency. Also there's generation counts > too. In fact check the attached "eoid.png" in this mail for a little bit of a > better explanation (there are in fact 2 more fields that are used for ref and > super call detection. ignore them as they are not relevant). > > On 64bit we have a lot of bits to play with. On 32 it's limited. We use 2 bits > for ref and super, so 30 left. Currently 8 bits for generation count, 5 bits > for the top table, 5 for the middle table, and last table has 12 bits (so 32 > entries, 32 more than 4096 in the leaf table). Generation count goes up to 255 > before looping to 0 as we always do generation++ every alloc. We might need to > steal 2 more bits, so i was thinking take one from generation count and one > from the last table entry value. a bit less safe and lower "max # of objects" > so on 32bit max 3 objects goes down to 2 million and Generation count to 0 or > 127. Safety of that part of the check is halved. :( > > So every time you do any call in efl like: > > efl_gfx_show(obj); > > We HAVE to resolve an EOID to a pointer. This was 2% of CPU time in the > genlist autobounce test, but now it's 5% thanks to needing to be thread safe. > This isn't great. We break up the EOID into its various fields, figure out > what table to access and what row, check for non-NULL pointers, then compare > generation count so it matches for added safety in case a slot is re-cycled > (the EOID without generation count is the same between a deleted object and a > newly created one and we accidentally use the old EOID). Every allocation of > an eo object makes generation count go up by 1 and we just use the lowest N > bits as the count, so if the slot is recycled the objects also have to be > allocated in the same generation (unlikely). On 64bit the above numbers are > pretty darned big, so there isn't an issue of space. Just on 32bit. > > So the real issue is this lock. Every time and object is looked up, created or > destroyed we need to lock and unlock all of this EOID table space and related > caches and generation count etc. It is just a single global lock but it's not > cheap since we do this SO OFTEN (yes it does have a cache to store the last > looked-up object, but this is global). > > What can we do? Well remove the lock? Make it cheaper? Well we can't make the > lock cheaper. I've looked. We MIGHT be able to make it smaller (not relevant > as there is a single global one), but not cheaper in cycles. More fine grained > locks won't help because we aren't talking contention here as the issue. > Literally a SUCCESSFUL lock take and then release with no contention takes 3% > of our CPU time. > > So here is an idea. I checked. A TLS lookup for any var is 1/5th the cost or > so of a lock+unlock. We could use __thread and this is FREE (no cost - well > it's a mov only), but this is only free in binaries not shared libraries, so > let's talk TLS which is much cheaper than a lock anyway. Why TLS? > > 1. It will drop out cost above a lot. > 2. We can remove several other locks in EO too cutting some more costs. > 3. We can REALLY CHEAPLY enforce the rule of "you may not access an object > outside its owning thread". Because every thread has it's own EOID table, the > EOID will be local to that thread only. Looking it up in another thread is > looking up an "invalid" EOID. In fact we can make this pretty much always fail > by using some domain bits in EOID like i mentioned above (steal some from > table entries and/or generation count). Now literally stuff will FAIL and not > magically work 99% of the time if you disobey these rules and access a > rectangle object or a timer from another thread that doesn't own them as the > objects literally are not in the local thread EOID table. The tread cannot see > them (well is very unlikely to see them). > > So how to do this? > > 0. Move the EOID table, and associated data into a TLS structure > 1. Every thread HAS to do an eo_thread_init() (eo_init can do this implicitly > for the mainloop thread, and we can make ecore_thread/later Efl.Thread do this > per thread it creates for you so you don't have to). > 2. This thread init sets up an initial generation count at a random value so > generation counts can't easily be in sync and maybe swizzles the order EOID's > are found/allocated and so on to minimize ID "sameness". > 3. We take 2 more bits as a thread domain for the EOID. Mainloop thread has > domain 0. All other threads have domain 1 UNLESS they explicitly ask for a > "alternate" domain and must do so before any eo objects are created and we > have domain 3 and 4 for this for "special" domains. > 4. We make it a clear rule that threads cannot access objects outside of those > that they created UNLESS: > 4.1 An object is explicitly SENT from one thread to another (we can do this > later but if this is done, the object must have a refcount of 1 only, no > parent, no children, no objects referenced in keys, weak refts to/from this > object etc.). We can release the EOID entry in thread 1, but not call > destructor and free object memory, send the POINTER to thread 2, and here a > new EOID local to that thread is allocated and that pointer adopted. > 4.2 A whole thread is blocked and its table adopted as a secondary table. This > is needed for ecore_main_loop_begin/end(). This is safe because the OWNING > thread is paused during this time. To do this the 2 threads must NOT SHARE the > same domain. So a regular thread (domain 1) can always do a begin/end on the > mainloop as they have differing domains. You have 2 more domains to use as > object brokers that are less likely to have latency like the mainloop. > 5. As long as we get the ptr of the TLS in the owning thread it will work when > sent to another thread, so the above sharing of 2 thread tables temporarily > can be done. > 7. It's an EO (and EFL)-wide rule that you should not make threadsafe objects > because EO just won't support it - you have to explicitly send objects around > or do a begin/end of another thread to look at it's objects (and then that is > limited to a thread of a different domain - we have 4 so not bad). > 8. We need to make sure eo doesn't allow you to do things like set a child of > an object that is in a different domain to the parent, or with weak refs, key > object refs etc. > 9. When we do object sending we need maybe pipes with fd's with functions to > process "incoming objects" and have some kind of eo object with callbacks to > tell you when it receives a new incoming object so when sent from one thread, > the thread releases it's EOID but the whole data is floating around UNTIL it > arrives at the other end and someone needs to know WHEN it arrives and what > the new EOID is. So we need some kind of per-thread/loop "object receiver" > and have to glue this into Efl.Loop if there is one in that thread. > > I have most of the nasty bits sorted out ... all but: > > When you are in a begin/end section and you see 2 EOID tables, when you CREATE > a new object... which one does it go into? Remember that when you CALL a > method on an obj it may go create objects internally too. How can you > determine which to use? You should be able to access both without creating > without issue with domains as above. You could delete fine since an object > knows which table it belongs to in the current thread context based on domain > number. They will be different. You can't bind a foreign domain in if it > matches yours - it'll fail. But creation is special. > > One option... if you create WITH a parent passed, the child must go into the > same domain automatically. Operations mixing domains in an object tree should > fail. What about other cases? Create a bare object with no parent... you add > as a child later. How to choose which domain it goes into? Local or fireign? > Maybe there is a context you can switch that is in your TLS that tells you > which to use (local or foreign table). If we have a push/pop setup it'd be > nice, but it's easy to get wrong. An explicit call to crate with foreign and > eo_add is local? So eo_foregin_add() uses the foreign domain (if adopted at > the time, and if not it will either fail or just use local domain then). Worth > thinking about. > > This buys us REALLY NICE "thread safety" in the way that objects are just not > allowed to span threads. They must be explicitly sent over and thus ownership > (and EOID value) changes, or you must explicitly do a begin/end on another > thread and adopt it's ID table into your local space as a "foregin" table that > allows you definitely "read only" access easily, even the ability to modify > and delete, but just creation is tricky. This really will clear up lots of > mistakes we have been seeing from code that uses threads and does "bad > things" that happen to work 99% of the time then fail oddly 1% of the time. > We don't have to write "is this my thread" checking code in every method > because the design will do that mostly for us as a side-effect of TLS and > normal EOID checking. This also buys us simplicity when dealing with objects > as we can assume the nice old fashioned way of "no need to lock or consider > threads - it will not be an issue", and it buys us a good speedup vs what we > have now. It does mean another bit of a re-jigging of eo internals and we > need to add some API's to be able to do begin/end etc. and we can later add > object sending. > > I'm rather happy with this kind of direction. We never have to make eo objects > thread safe or create a special eo thread safe base class. Ever. You want to > talk from thread to thread, then we can have endpoint objects that get created > with one object on one end of the msg pipe (like a socketpair() or pipe()), > and a different object on the other end (create in one place then send one > end to another thread? The object internals hook them up via pipes or > threadqueues?). This makes currently "incorrect and dangerous code" fail > early instead of 1% of the time. It catches issues fast. It gets us speedups. > This potentially impacts promises too, but probably in a good way. This also > affects bindings - looking at JS/Lua specificially. If Lua had a threading > model right now it'd be to have 1 luastate per thread, but this means we > can't share objects... this means eo's model is the same and you would have > to detach an object from one thread (luastate) and make one appear on the > other end. I'm also happy that I think this solves a disagreement on how to > do threading. I know we have to somehow and not just stick our heads in the > sand. This solves it. It gives a clear optimal model that is relatively > robust and efficient. > > I can see benefits for memory allocation too. Specifically memory directly > needed for objects. We can add our own allocators for this instead of using > libc and these allocators DO NOT need to be threadsafe thus gaining speed > (well catch - if you send an object across we have to somehow deal with > this... unknown at the moment - an issue to consider)? > > The downsides are the extra work, the need to add API's to init eo per thread, > need to check for objects on thread shutdown (need the TLS free func to do > sanity checking for still-alive objects), the need to add all this foreign vs. > local table stuff for begin+end and then to hook that in, and the likely need > to add at least object sending, and the messaging infra. Also it will use a > bit more memory if you use eo across threads and actually create objects in > other threads. As well any mempools and other things that we currently share > across threads may need to become per-thread. > > So ... with that. Questions, comments, queries, devils-advocate. Find issues > with this. Wrap your head around it. This is very important as this impacts > many things subtly and some directly and clearly. > > -- > ------------- Codito, ergo sum - "I code, therefore I am" -------------- > The Rasterman (Carsten Haitzler) ras...@rasterman.com > -- ------------- Codito, ergo sum - "I code, therefore I am" -------------- The Rasterman (Carsten Haitzler) ras...@rasterman.com ------------------------------------------------------------------------------ _______________________________________________ enlightenment-devel mailing list enlightenment-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/enlightenment-devel