[E-devel] RFC: EOID + Threads + TLS proposal

The Rasterman Sat, 03 Sep 2016 08:17:23 -0700

Why? This directly impacts design and limitations of EO. It also impacts
performance. It also impacts safety. This is rather important to settle for
1.19 as this lays the base for the future of EFL.


So this past week I was trying to improve EOID lookup and found the EOID tables
and caches were not locked so if you accessed any object or created or deleted
objects across 2 or more threads you had race conditions waiting to happen. No
locks! So I added a spinlock. In the tests there should be zero contention so
the lock doesn't spin.

Result? Cost of EOID lookup (to find the real pointer of the object) went from
2% to 5% of CPU time. That's how much thread safety costs. It's not cheap. This
got me thinking. "There has to be another way...".

So first. What is EOID? This is a 32 or 64bit VALUE we place in a pointer (yes
I know technically not portable bizarre architectures - We had to do this to
keep ABI in EFL). This ID gives us an INDIRECT lookup to an object pointer.
Imagine the simplest version in your head is

Object *obj  = all_eo_objects[eoid];

So we take the value and check a table. The object pointer is then either a
valid ptr (not NULL) and the object is safe to use, or it's NULL (because eo
maintains the table and ensures any deleted object has it's array member set to
NULL). It is a little more complex than that. It's fragmented into several
levels of smaller tables for memory efficiency. Also there's generation counts
too. In fact check the attached "eoid.png" in this mail for a little bit of a
better explanation (there are in fact 2 more fields that are used for ref and
super call detection. ignore them as they are not relevant).

On 64bit we have a lot of bits to play with. On 32 it's limited. We use 2 bits
for ref and super, so 30 left. Currently 8 bits for generation count, 5 bits
for the top table, 5 for the middle table, and last table has 12 bits (so 32
entries, 32 more than 4096 in the leaf table). Generation count goes up to 255
before looping to 0 as we always do generation++ every alloc. We might need to
steal 2 more bits, so i was thinking take one from generation count and one
from the last table entry value. a bit less safe and lower "max # of objects"
so on 32bit max 3 objects goes down to 2 million and Generation count to 0 or
127. Safety of that part of the check is halved. :(

So every time you do any call in efl like:

  efl_gfx_show(obj);

We HAVE to resolve an EOID to a pointer. This was 2% of CPU time in the genlist
autobounce test, but now it's 5% thanks to needing to be thread safe. This
isn't great. We break up the EOID into its various fields, figure out what
table to access and what row, check for non-NULL pointers, then compare
generation count so it matches for added safety in case a slot is re-cycled
(the EOID without generation count is the same between a deleted object and a
newly created one and we accidentally use the old EOID). Every allocation of an
eo object makes generation count go up by 1 and we just use the lowest N bits
as the count, so if the slot is recycled the objects also have to be allocated
in the same generation (unlikely). On 64bit the above numbers are pretty
darned big, so there isn't an issue of space. Just on 32bit.

So the real issue is this lock. Every time and object is looked up, created or
destroyed we need to lock and unlock all of this EOID table space and related
caches and generation count etc. It is just a single global lock but it's not
cheap since we do this SO OFTEN (yes it does have a cache to store the last
looked-up object, but this is global).

What can we do? Well remove the lock? Make it cheaper? Well we can't make the
lock cheaper. I've looked. We MIGHT be able to make it smaller (not relevant as
there is a single global one), but not cheaper in cycles. More fine grained
locks won't help because we aren't talking contention here as the issue.
Literally a SUCCESSFUL lock take and then release with no contention takes 3%
of our CPU time.

So here is an idea. I checked. A TLS lookup for any var is 1/5th the cost or so
of a lock+unlock. We could use __thread and this is FREE (no cost - well it's a
mov only), but this is only free in binaries not shared libraries, so let's talk
TLS which is much cheaper than a lock anyway. Why TLS?

1. It will drop out cost above a lot.
2. We can remove several other locks in EO too cutting some more costs.
3. We can REALLY CHEAPLY enforce the rule of "you may not access an object
outside its owning thread". Because every thread has it's own EOID table, the
EOID will be local to that thread only. Looking it up in another thread is
looking up an "invalid" EOID. In fact we can make this pretty much always fail
by using some domain bits in EOID like i mentioned above (steal some from table
entries and/or generation count). Now literally stuff will FAIL and not
magically work 99% of the time if you disobey these rules and access a
rectangle object or a timer from another thread that doesn't own them as the
objects literally are not in the local thread EOID table. The tread cannot see
them (well is very unlikely to see them).

So how to do this?

0. Move the EOID table, and associated data into a TLS structure
1. Every thread HAS to do an eo_thread_init() (eo_init can do this implicitly
for the mainloop thread, and we can make ecore_thread/later Efl.Thread do this
per thread it creates for you so you don't have to).
2. This thread init sets up an initial generation count at a random value so
generation counts can't easily be in sync and maybe swizzles the order EOID's
are found/allocated and so on to minimize ID "sameness".
3. We take 2 more bits as a thread domain for the EOID. Mainloop thread has
domain 0. All other threads have domain 1 UNLESS they explicitly ask for a
"alternate" domain and must do so before any eo objects are created and we have
domain 3 and 4 for this for "special" domains.
4. We make it a clear rule that threads cannot access objects outside of those
that they created UNLESS:
4.1 An object is explicitly SENT from one thread to another (we can do this
later but if this is done, the object must have a refcount of 1 only, no
parent, no children, no objects referenced in keys, weak refts to/from this
object etc.). We can release the EOID entry in thread 1, but not call
destructor and free object memory, send the POINTER to thread 2, and here a new
EOID local to that thread is allocated and that pointer adopted.
4.2 A whole thread is blocked and its table adopted as a secondary table. This
is needed for ecore_main_loop_begin/end(). This is safe because the OWNING
thread is paused during this time. To do this the 2 threads must NOT SHARE the
same domain. So a regular thread (domain 1) can always do a begin/end on the
mainloop as they have differing domains. You have 2 more domains to use as
object brokers that are less likely to have latency like the mainloop.
5. As long as we get the ptr of the TLS in the owning thread it will work when
sent to another thread, so the above sharing of 2 thread tables temporarily can
be done.
7. It's an EO (and EFL)-wide rule that you should not make threadsafe objects
because EO just won't support it - you have to explicitly send objects around
or do a begin/end of another thread to look at it's objects (and then that is
limited to a thread of a different domain - we have 4 so not bad).
8. We need to make sure eo doesn't allow you to do things like set a child of
an object that is in a different domain to the parent, or with weak refs, key
object refs etc.
9. When we do object sending we need maybe pipes with fd's with functions to
process "incoming objects" and have some kind of eo object with callbacks to
tell you when it receives a new incoming object so when sent from one thread,
the thread releases it's EOID but the whole data is floating around UNTIL it
arrives at the other end and someone needs to know WHEN it arrives and what the
new EOID is. So we need some kind of per-thread/loop "object receiver" and have
to glue this into Efl.Loop if there is one in that thread.

I have most of the nasty bits sorted out ... all but:

When you are in a begin/end section and you see 2 EOID tables, when you CREATE
a new object... which one does it go into? Remember that when you CALL a method
on an obj it may go create objects internally too. How can you determine which
to use? You should be able to access both without creating without issue with
domains as above. You could delete fine since an object knows which table it
belongs to in the current thread context based on domain number. They will be
different. You can't bind a foreign domain in if it matches yours - it'll fail.
But creation is special.

One option... if you create WITH a parent passed, the child must go into the
same domain automatically. Operations mixing domains in an object tree should
fail. What about other cases? Create a bare object with no parent... you add
as a child later. How to choose which domain it goes into? Local or fireign?
Maybe there is a context you can switch that is in your TLS that tells you
which to use (local or foreign table). If we have a push/pop setup it'd be
nice, but it's easy to get wrong. An explicit call to crate with foreign and
eo_add is local? So eo_foregin_add() uses the foreign domain (if adopted at the
time, and if not it will either fail or just use local domain then). Worth
thinking about.

This buys us REALLY NICE "thread safety" in the way that objects are just not
allowed to span threads. They must be explicitly sent over and thus ownership
(and EOID value) changes, or you must explicitly do a begin/end on another
thread and adopt it's ID table into your local space as a "foregin" table that
allows you definitely "read only" access easily, even the ability to modify and
delete, but just creation is tricky. This really will clear up lots of mistakes
we have been seeing from code that uses threads and does "bad things" that
happen to work 99% of the time then fail oddly 1% of the time. We don't have to
write "is this my thread" checking code in every method because the design will
do that mostly for us as a side-effect of TLS and normal EOID checking. This
also buys us simplicity when dealing with objects as we can assume the nice old
fashioned way of "no need to lock or consider threads - it will not be an
issue", and it buys us a good speedup vs what we have now. It does mean another
bit of a re-jigging of eo internals and we need to add some API's to be able to
do begin/end etc. and we can later add object sending.

I'm rather happy with this kind of direction. We never have to make eo objects
thread safe or create a special eo thread safe base class. Ever. You want to
talk from thread to thread, then we can have endpoint objects that get created
with one object on one end of the msg pipe (like a socketpair() or pipe()), and
a different object on the other end (create in one place then send one end to
another thread? The object internals hook them up via pipes or threadqueues?).
This makes currently "incorrect and dangerous code" fail early instead of 1% of
the time. It catches issues fast. It gets us speedups. This potentially impacts
promises too, but probably in a good way. This also affects bindings - looking
at JS/Lua specificially. If Lua had a threading model right now it'd be to have
1 luastate per thread, but this means we can't share objects... this means eo's
model is the same and you would have to detach an object from one thread
(luastate) and make one appear on the other end. I'm also happy that I think
this solves a disagreement on how to do threading. I know we have to somehow
and not just stick our heads in the sand. This solves it. It gives a clear
optimal model that is relatively robust and efficient.

I can see benefits for memory allocation too. Specifically memory directly
needed for objects. We can add our own allocators for this instead of using
libc and these allocators DO NOT need to be threadsafe thus gaining speed (well
catch - if you send an object across we have to somehow deal with this...
unknown at the moment - an issue to consider)?

The downsides are the extra work, the need to add API's to init eo per thread,
need to check for objects on thread shutdown (need the TLS free func to do
sanity checking for still-alive objects), the need to add all this foreign vs.
local table stuff for begin+end and then to hook that in, and the likely need
to add at least object sending, and the messaging infra. Also it will use a bit
more memory if you use eo across threads and actually create objects in other
threads. As well any mempools and other things that we currently share across
threads may need to become per-thread.

So ... with that. Questions, comments, queries, devils-advocate. Find issues
with this. Wrap your head around it. This is very important as this impacts
many things subtly and some directly and clearly.

-- 
------------- Codito, ergo sum - "I code, therefore I am" --------------
The Rasterman (Carsten Haitzler)    ras...@rasterman.com

------------------------------------------------------------------------------

_______________________________________________
enlightenment-devel mailing list
enlightenment-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/enlightenment-devel

[E-devel] RFC: EOID + Threads + TLS proposal

Reply via email to