Re: [E-devel] RFC: EOID + Threads + TLS proposal

Tom Hacohen Tue, 06 Sep 2016 07:26:10 -0700

On 03/09/16 16:15, Carsten Haitzler (The Rasterman) wrote:
> Why? This directly impacts design and limitations of EO. It also impacts
> performance. It also impacts safety. This is rather important to settle for
> 1.19 as this lays the base for the future of EFL.
>
> So this past week I was trying to improve EOID lookup and found the EOID 
> tables
> and caches were not locked so if you accessed any object or created or deleted
> objects across 2 or more threads you had race conditions waiting to happen. No
> locks! So I added a spinlock. In the tests there should be zero contention so
> the lock doesn't spin.
>
> Result? Cost of EOID lookup (to find the real pointer of the object) went from
> 2% to 5% of CPU time. That's how much thread safety costs. It's not cheap. 
> This
> got me thinking. "There has to be another way...".
>
> So first. What is EOID? This is a 32 or 64bit VALUE we place in a pointer (yes
> I know technically not portable bizarre architectures - We had to do this to
> keep ABI in EFL). This ID gives us an INDIRECT lookup to an object pointer.
> Imagine the simplest version in your head is
>
> Object *obj  = all_eo_objects[eoid];
>
> So we take the value and check a table. The object pointer is then either a
> valid ptr (not NULL) and the object is safe to use, or it's NULL (because eo
> maintains the table and ensures any deleted object has it's array member set 
> to
> NULL). It is a little more complex than that. It's fragmented into several
> levels of smaller tables for memory efficiency. Also there's generation counts
> too. In fact check the attached "eoid.png" in this mail for a little bit of a
> better explanation (there are in fact 2 more fields that are used for ref and
> super call detection. ignore them as they are not relevant).
>
> On 64bit we have a lot of bits to play with. On 32 it's limited. We use 2 bits
> for ref and super, so 30 left. Currently 8 bits for generation count, 5 bits
> for the top table, 5 for the middle table, and last table has 12 bits (so 32
> entries, 32 more than 4096 in the leaf table). Generation count goes up to 255
> before looping to 0 as we always do generation++ every alloc. We might need to
> steal 2 more bits, so i was thinking take one from generation count and one
> from the last table entry value. a bit less safe and lower "max # of objects"
> so on 32bit max 3 objects goes down to 2 million and Generation count to 0 or
> 127. Safety of that part of the check is halved. :(
>
> So every time you do any call in efl like:
>
>   efl_gfx_show(obj);
>
> We HAVE to resolve an EOID to a pointer. This was 2% of CPU time in the 
> genlist
> autobounce test, but now it's 5% thanks to needing to be thread safe. This
> isn't great. We break up the EOID into its various fields, figure out what
> table to access and what row, check for non-NULL pointers, then compare
> generation count so it matches for added safety in case a slot is re-cycled
> (the EOID without generation count is the same between a deleted object and a
> newly created one and we accidentally use the old EOID). Every allocation of 
> an
> eo object makes generation count go up by 1 and we just use the lowest N bits
> as the count, so if the slot is recycled the objects also have to be allocated
> in the same generation (unlikely). On 64bit the above numbers are pretty
> darned big, so there isn't an issue of space. Just on 32bit.
>
> So the real issue is this lock. Every time and object is looked up, created or
> destroyed we need to lock and unlock all of this EOID table space and related
> caches and generation count etc. It is just a single global lock but it's not
> cheap since we do this SO OFTEN (yes it does have a cache to store the last
> looked-up object, but this is global).
>
> What can we do? Well remove the lock? Make it cheaper? Well we can't make the
> lock cheaper. I've looked. We MIGHT be able to make it smaller (not relevant 
> as
> there is a single global one), but not cheaper in cycles. More fine grained
> locks won't help because we aren't talking contention here as the issue.
> Literally a SUCCESSFUL lock take and then release with no contention takes 3%
> of our CPU time.
>
> So here is an idea. I checked. A TLS lookup for any var is 1/5th the cost or 
> so
> of a lock+unlock. We could use __thread and this is FREE (no cost - well it's 
> a
> mov only), but this is only free in binaries not shared libraries, so let's 
> talk
> TLS which is much cheaper than a lock anyway. Why TLS?
>
> 1. It will drop out cost above a lot.
> 2. We can remove several other locks in EO too cutting some more costs.
> 3. We can REALLY CHEAPLY enforce the rule of "you may not access an object
> outside its owning thread". Because every thread has it's own EOID table, the
> EOID will be local to that thread only. Looking it up in another thread is
> looking up an "invalid" EOID. In fact we can make this pretty much always fail
> by using some domain bits in EOID like i mentioned above (steal some from 
> table
> entries and/or generation count). Now literally stuff will FAIL and not
> magically work 99% of the time if you disobey these rules and access a
> rectangle object or a timer from another thread that doesn't own them as the
> objects literally are not in the local thread EOID table. The tread cannot see
> them (well is very unlikely to see them).
>
> So how to do this?
>
> 0. Move the EOID table, and associated data into a TLS structure
> 1. Every thread HAS to do an eo_thread_init() (eo_init can do this implicitly
> for the mainloop thread, and we can make ecore_thread/later Efl.Thread do this
> per thread it creates for you so you don't have to).
> 2. This thread init sets up an initial generation count at a random value so
> generation counts can't easily be in sync and maybe swizzles the order EOID's
> are found/allocated and so on to minimize ID "sameness".
> 3. We take 2 more bits as a thread domain for the EOID. Mainloop thread has
> domain 0. All other threads have domain 1 UNLESS they explicitly ask for a
> "alternate" domain and must do so before any eo objects are created and we 
> have
> domain 3 and 4 for this for "special" domains.
> 4. We make it a clear rule that threads cannot access objects outside of those
> that they created UNLESS:
> 4.1 An object is explicitly SENT from one thread to another (we can do this
> later but if this is done, the object must have a refcount of 1 only, no
> parent, no children, no objects referenced in keys, weak refts to/from this
> object etc.). We can release the EOID entry in thread 1, but not call
> destructor and free object memory, send the POINTER to thread 2, and here a 
> new
> EOID local to that thread is allocated and that pointer adopted.
> 4.2 A whole thread is blocked and its table adopted as a secondary table. This
> is needed for ecore_main_loop_begin/end(). This is safe because the OWNING
> thread is paused during this time. To do this the 2 threads must NOT SHARE the
> same domain. So a regular thread (domain 1) can always do a begin/end on the
> mainloop as they have differing domains. You have 2 more domains to use as
> object brokers that are less likely to have latency like the mainloop.
> 5. As long as we get the ptr of the TLS in the owning thread it will work when
> sent to another thread, so the above sharing of 2 thread tables temporarily 
> can
> be done.
> 7. It's an EO (and EFL)-wide rule that you should not make threadsafe objects
> because EO just won't support it - you have to explicitly send objects around
> or do a begin/end of another thread to look at it's objects (and then that is
> limited to a thread of a different domain - we have 4 so not bad).
> 8. We need to make sure eo doesn't allow you to do things like set a child of
> an object that is in a different domain to the parent, or with weak refs, key
> object refs etc.
> 9. When we do object sending we need maybe pipes with fd's with functions to
> process "incoming objects" and have some kind of eo object with callbacks to
> tell you when it receives a new incoming object so when sent from one thread,
> the thread releases it's EOID but the whole data is floating around UNTIL it
> arrives at the other end and someone needs to know WHEN it arrives and what 
> the
> new EOID is. So we need some kind of per-thread/loop "object receiver" and 
> have
> to glue this into Efl.Loop if there is one in that thread.
>
> I have most of the nasty bits sorted out ... all but:
>
> When you are in a begin/end section and you see 2 EOID tables, when you CREATE
> a new object... which one does it go into? Remember that when you CALL a 
> method
> on an obj it may go create objects internally too. How can you determine which
> to use? You should be able to access both without creating without issue with
> domains as above. You could delete fine since an object knows which table it
> belongs to in the current thread context based on domain number. They will be
> different. You can't bind a foreign domain in if it matches yours - it'll 
> fail.
> But creation is special.
>
> One option... if you create WITH a parent passed, the child must go into the
> same domain automatically. Operations mixing domains in an object tree should
> fail. What about other cases? Create a bare object with no parent... you add
> as a child later. How to choose which domain it goes into? Local or fireign?
> Maybe there is a context you can switch that is in your TLS that tells you
> which to use (local or foreign table). If we have a push/pop setup it'd be
> nice, but it's easy to get wrong. An explicit call to crate with foreign and
> eo_add is local? So eo_foregin_add() uses the foreign domain (if adopted at 
> the
> time, and if not it will either fail or just use local domain then). Worth
> thinking about.
>
> This buys us REALLY NICE "thread safety" in the way that objects are just not
> allowed to span threads. They must be explicitly sent over and thus ownership
> (and EOID value) changes, or you must explicitly do a begin/end on another
> thread and adopt it's ID table into your local space as a "foregin" table that
> allows you definitely "read only" access easily, even the ability to modify 
> and
> delete, but just creation is tricky. This really will clear up lots of 
> mistakes
> we have been seeing from code that uses threads and does "bad things" that
> happen to work 99% of the time then fail oddly 1% of the time. We don't have 
> to
> write "is this my thread" checking code in every method because the design 
> will
> do that mostly for us as a side-effect of TLS and normal EOID checking. This
> also buys us simplicity when dealing with objects as we can assume the nice 
> old
> fashioned way of "no need to lock or consider threads - it will not be an
> issue", and it buys us a good speedup vs what we have now. It does mean 
> another
> bit of a re-jigging of eo internals and we need to add some API's to be able 
> to
> do begin/end etc. and we can later add object sending.
>
> I'm rather happy with this kind of direction. We never have to make eo objects
> thread safe or create a special eo thread safe base class. Ever. You want to
> talk from thread to thread, then we can have endpoint objects that get created
> with one object on one end of the msg pipe (like a socketpair() or pipe()), 
> and
> a different object on the other end (create in one place then send one end to
> another thread? The object internals hook them up via pipes or threadqueues?).
> This makes currently "incorrect and dangerous code" fail early instead of 1% 
> of
> the time. It catches issues fast. It gets us speedups. This potentially 
> impacts
> promises too, but probably in a good way. This also affects bindings - looking
> at JS/Lua specificially. If Lua had a threading model right now it'd be to 
> have
> 1 luastate per thread, but this means we can't share objects... this means 
> eo's
> model is the same and you would have to detach an object from one thread
> (luastate) and make one appear on the other end. I'm also happy that I think
> this solves a disagreement on how to do threading. I know we have to somehow
> and not just stick our heads in the sand. This solves it. It gives a clear
> optimal model that is relatively robust and efficient.
>
> I can see benefits for memory allocation too. Specifically memory directly
> needed for objects. We can add our own allocators for this instead of using
> libc and these allocators DO NOT need to be threadsafe thus gaining speed 
> (well
> catch - if you send an object across we have to somehow deal with this...
> unknown at the moment - an issue to consider)?
>
> The downsides are the extra work, the need to add API's to init eo per thread,
> need to check for objects on thread shutdown (need the TLS free func to do
> sanity checking for still-alive objects), the need to add all this foreign vs.
> local table stuff for begin+end and then to hook that in, and the likely need
> to add at least object sending, and the messaging infra. Also it will use a 
> bit
> more memory if you use eo across threads and actually create objects in other
> threads. As well any mempools and other things that we currently share across
> threads may need to become per-thread.
>
> So ... with that. Questions, comments, queries, devils-advocate. Find issues
> with this. Wrap your head around it. This is very important as this impacts
> many things subtly and some directly and clearly.


Assuming it's what we talked about on IRC, I'm happy with it. It's 
essentially an implementation what I wanted from the beginning (Eo being 
single threaded and if you want multi-thread, do something like 
socketpair()) with the added coolness of enforcing IDs are unique per 
thread. We'll probably need to change the error messages about ID not 
found to maybe check the other threads if the ID is valid there so we 
can say something like: "You tried accessing object X from thread Y, but 
it's actually an object of thread Z. DENIED! Read more about foobar().".

My only concern, as always, is performance. I know you said locks are 
slower than TLS, but while we only needed to lock on write, now we use 
TLS on both read and write, so it is a concern.

Hope it pans out well. I'm looking forward to seeing benchmarks (and of 
course, tests with threads and etc). :)

--
Tom.

------------------------------------------------------------------------------
_______________________________________________
enlightenment-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/enlightenment-devel

Re: [E-devel] RFC: EOID + Threads + TLS proposal

Reply via email to