Re: [E-devel] RFC: EOID + Threads + TLS proposal

The Rasterman Mon, 05 Sep 2016 22:29:49 -0700

On Sun, 4 Sep 2016 00:15:15 +0900 Carsten Haitzler (The Rasterman)
<[email protected]> said:


THIS TIME it's right:

https://phab.enlightenment.org/D4281

it's a start. a good one.

> Why? This directly impacts design and limitations of EO. It also impacts
> performance. It also impacts safety. This is rather important to settle for
> 1.19 as this lays the base for the future of EFL.
> 
> So this past week I was trying to improve EOID lookup and found the EOID
> tables and caches were not locked so if you accessed any object or created or
> deleted objects across 2 or more threads you had race conditions waiting to
> happen. No locks! So I added a spinlock. In the tests there should be zero
> contention so the lock doesn't spin.
> 
> Result? Cost of EOID lookup (to find the real pointer of the object) went from
> 2% to 5% of CPU time. That's how much thread safety costs. It's not cheap.
> This got me thinking. "There has to be another way...".
> 
> So first. What is EOID? This is a 32 or 64bit VALUE we place in a pointer (yes
> I know technically not portable bizarre architectures - We had to do this to
> keep ABI in EFL). This ID gives us an INDIRECT lookup to an object pointer.
> Imagine the simplest version in your head is
> 
> Object *obj  = all_eo_objects[eoid];
> 
> So we take the value and check a table. The object pointer is then either a
> valid ptr (not NULL) and the object is safe to use, or it's NULL (because eo
> maintains the table and ensures any deleted object has it's array member set
> to NULL). It is a little more complex than that. It's fragmented into several
> levels of smaller tables for memory efficiency. Also there's generation counts
> too. In fact check the attached "eoid.png" in this mail for a little bit of a
> better explanation (there are in fact 2 more fields that are used for ref and
> super call detection. ignore them as they are not relevant).
> 
> On 64bit we have a lot of bits to play with. On 32 it's limited. We use 2 bits
> for ref and super, so 30 left. Currently 8 bits for generation count, 5 bits
> for the top table, 5 for the middle table, and last table has 12 bits (so 32
> entries, 32 more than 4096 in the leaf table). Generation count goes up to 255
> before looping to 0 as we always do generation++ every alloc. We might need to
> steal 2 more bits, so i was thinking take one from generation count and one
> from the last table entry value. a bit less safe and lower "max # of objects"
> so on 32bit max 3 objects goes down to 2 million and Generation count to 0 or
> 127. Safety of that part of the check is halved. :(
> 
> So every time you do any call in efl like:
> 
>   efl_gfx_show(obj);
> 
> We HAVE to resolve an EOID to a pointer. This was 2% of CPU time in the
> genlist autobounce test, but now it's 5% thanks to needing to be thread safe.
> This isn't great. We break up the EOID into its various fields, figure out
> what table to access and what row, check for non-NULL pointers, then compare
> generation count so it matches for added safety in case a slot is re-cycled
> (the EOID without generation count is the same between a deleted object and a
> newly created one and we accidentally use the old EOID). Every allocation of
> an eo object makes generation count go up by 1 and we just use the lowest N
> bits as the count, so if the slot is recycled the objects also have to be
> allocated in the same generation (unlikely). On 64bit the above numbers are
> pretty darned big, so there isn't an issue of space. Just on 32bit.
> 
> So the real issue is this lock. Every time and object is looked up, created or
> destroyed we need to lock and unlock all of this EOID table space and related
> caches and generation count etc. It is just a single global lock but it's not
> cheap since we do this SO OFTEN (yes it does have a cache to store the last
> looked-up object, but this is global).
> 
> What can we do? Well remove the lock? Make it cheaper? Well we can't make the
> lock cheaper. I've looked. We MIGHT be able to make it smaller (not relevant
> as there is a single global one), but not cheaper in cycles. More fine grained
> locks won't help because we aren't talking contention here as the issue.
> Literally a SUCCESSFUL lock take and then release with no contention takes 3%
> of our CPU time.
> 
> So here is an idea. I checked. A TLS lookup for any var is 1/5th the cost or
> so of a lock+unlock. We could use __thread and this is FREE (no cost - well
> it's a mov only), but this is only free in binaries not shared libraries, so
> let's talk TLS which is much cheaper than a lock anyway. Why TLS?
> 
> 1. It will drop out cost above a lot.
> 2. We can remove several other locks in EO too cutting some more costs.
> 3. We can REALLY CHEAPLY enforce the rule of "you may not access an object
> outside its owning thread". Because every thread has it's own EOID table, the
> EOID will be local to that thread only. Looking it up in another thread is
> looking up an "invalid" EOID. In fact we can make this pretty much always fail
> by using some domain bits in EOID like i mentioned above (steal some from
> table entries and/or generation count). Now literally stuff will FAIL and not
> magically work 99% of the time if you disobey these rules and access a
> rectangle object or a timer from another thread that doesn't own them as the
> objects literally are not in the local thread EOID table. The tread cannot see
> them (well is very unlikely to see them).
> 
> So how to do this?
> 
> 0. Move the EOID table, and associated data into a TLS structure
> 1. Every thread HAS to do an eo_thread_init() (eo_init can do this implicitly
> for the mainloop thread, and we can make ecore_thread/later Efl.Thread do this
> per thread it creates for you so you don't have to).
> 2. This thread init sets up an initial generation count at a random value so
> generation counts can't easily be in sync and maybe swizzles the order EOID's
> are found/allocated and so on to minimize ID "sameness".
> 3. We take 2 more bits as a thread domain for the EOID. Mainloop thread has
> domain 0. All other threads have domain 1 UNLESS they explicitly ask for a
> "alternate" domain and must do so before any eo objects are created and we
> have domain 3 and 4 for this for "special" domains.
> 4. We make it a clear rule that threads cannot access objects outside of those
> that they created UNLESS:
> 4.1 An object is explicitly SENT from one thread to another (we can do this
> later but if this is done, the object must have a refcount of 1 only, no
> parent, no children, no objects referenced in keys, weak refts to/from this
> object etc.). We can release the EOID entry in thread 1, but not call
> destructor and free object memory, send the POINTER to thread 2, and here a
> new EOID local to that thread is allocated and that pointer adopted.
> 4.2 A whole thread is blocked and its table adopted as a secondary table. This
> is needed for ecore_main_loop_begin/end(). This is safe because the OWNING
> thread is paused during this time. To do this the 2 threads must NOT SHARE the
> same domain. So a regular thread (domain 1) can always do a begin/end on the
> mainloop as they have differing domains. You have 2 more domains to use as
> object brokers that are less likely to have latency like the mainloop.
> 5. As long as we get the ptr of the TLS in the owning thread it will work when
> sent to another thread, so the above sharing of 2 thread tables temporarily
> can be done.
> 7. It's an EO (and EFL)-wide rule that you should not make threadsafe objects
> because EO just won't support it - you have to explicitly send objects around
> or do a begin/end of another thread to look at it's objects (and then that is
> limited to a thread of a different domain - we have 4 so not bad).
> 8. We need to make sure eo doesn't allow you to do things like set a child of
> an object that is in a different domain to the parent, or with weak refs, key
> object refs etc.
> 9. When we do object sending we need maybe pipes with fd's with functions to
> process "incoming objects" and have some kind of eo object with callbacks to
> tell you when it receives a new incoming object so when sent from one thread,
> the thread releases it's EOID but the whole data is floating around UNTIL it
> arrives at the other end and someone needs to know WHEN it arrives and what
> the new EOID is. So we need some kind of per-thread/loop "object receiver"
> and have to glue this into Efl.Loop if there is one in that thread.
> 
> I have most of the nasty bits sorted out ... all but:
> 
> When you are in a begin/end section and you see 2 EOID tables, when you CREATE
> a new object... which one does it go into? Remember that when you CALL a
> method on an obj it may go create objects internally too. How can you
> determine which to use? You should be able to access both without creating
> without issue with domains as above. You could delete fine since an object
> knows which table it belongs to in the current thread context based on domain
> number. They will be different. You can't bind a foreign domain in if it
> matches yours - it'll fail. But creation is special.
> 
> One option... if you create WITH a parent passed, the child must go into the
> same domain automatically. Operations mixing domains in an object tree should
> fail. What about other cases? Create a bare object with no parent... you add
> as a child later. How to choose which domain it goes into? Local or fireign?
> Maybe there is a context you can switch that is in your TLS that tells you
> which to use (local or foreign table). If we have a push/pop setup it'd be
> nice, but it's easy to get wrong. An explicit call to crate with foreign and
> eo_add is local? So eo_foregin_add() uses the foreign domain (if adopted at
> the time, and if not it will either fail or just use local domain then). Worth
> thinking about.
> 
> This buys us REALLY NICE "thread safety" in the way that objects are just not
> allowed to span threads. They must be explicitly sent over and thus ownership
> (and EOID value) changes, or you must explicitly do a begin/end on another
> thread and adopt it's ID table into your local space as a "foregin" table that
> allows you definitely "read only" access easily, even the ability to modify
> and delete, but just creation is tricky. This really will clear up lots of
> mistakes we have been seeing from code that uses threads and does "bad
> things" that happen to work 99% of the time then fail oddly 1% of the time.
> We don't have to write "is this my thread" checking code in every method
> because the design will do that mostly for us as a side-effect of TLS and
> normal EOID checking. This also buys us simplicity when dealing with objects
> as we can assume the nice old fashioned way of "no need to lock or consider
> threads - it will not be an issue", and it buys us a good speedup vs what we
> have now. It does mean another bit of a re-jigging of eo internals and we
> need to add some API's to be able to do begin/end etc. and we can later add
> object sending.
> 
> I'm rather happy with this kind of direction. We never have to make eo objects
> thread safe or create a special eo thread safe base class. Ever. You want to
> talk from thread to thread, then we can have endpoint objects that get created
> with one object on one end of the msg pipe (like a socketpair() or pipe()),
> and a different object on the other end (create in one place then send one
> end to another thread? The object internals hook them up via pipes or
> threadqueues?). This makes currently "incorrect and dangerous code" fail
> early instead of 1% of the time. It catches issues fast. It gets us speedups.
> This potentially impacts promises too, but probably in a good way. This also
> affects bindings - looking at JS/Lua specificially. If Lua had a threading
> model right now it'd be to have 1 luastate per thread, but this means we
> can't share objects... this means eo's model is the same and you would have
> to detach an object from one thread (luastate) and make one appear on the
> other end. I'm also happy that I think this solves a disagreement on how to
> do threading. I know we have to somehow and not just stick our heads in the
> sand. This solves it. It gives a clear optimal model that is relatively
> robust and efficient.
> 
> I can see benefits for memory allocation too. Specifically memory directly
> needed for objects. We can add our own allocators for this instead of using
> libc and these allocators DO NOT need to be threadsafe thus gaining speed
> (well catch - if you send an object across we have to somehow deal with
> this... unknown at the moment - an issue to consider)?
> 
> The downsides are the extra work, the need to add API's to init eo per thread,
> need to check for objects on thread shutdown (need the TLS free func to do
> sanity checking for still-alive objects), the need to add all this foreign vs.
> local table stuff for begin+end and then to hook that in, and the likely need
> to add at least object sending, and the messaging infra. Also it will use a
> bit more memory if you use eo across threads and actually create objects in
> other threads. As well any mempools and other things that we currently share
> across threads may need to become per-thread.
> 
> So ... with that. Questions, comments, queries, devils-advocate. Find issues
> with this. Wrap your head around it. This is very important as this impacts
> many things subtly and some directly and clearly.
> 
> -- 
> ------------- Codito, ergo sum - "I code, therefore I am" --------------
> The Rasterman (Carsten Haitzler)    [email protected]
> 


-- 
------------- Codito, ergo sum - "I code, therefore I am" --------------
The Rasterman (Carsten Haitzler)    [email protected]


------------------------------------------------------------------------------
_______________________________________________
enlightenment-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/enlightenment-devel

Re: [E-devel] RFC: EOID + Threads + TLS proposal

Reply via email to