On Tue, 6 Sep 2016 15:24:44 +0100 Tom Hacohen <t...@osg.samsung.com> said:

> On 03/09/16 16:15, Carsten Haitzler (The Rasterman) wrote:
> > Why? This directly impacts design and limitations of EO. It also impacts
> > performance. It also impacts safety. This is rather important to settle for
> > 1.19 as this lays the base for the future of EFL.
> >
> > So this past week I was trying to improve EOID lookup and found the EOID
> > tables and caches were not locked so if you accessed any object or created
> > or deleted objects across 2 or more threads you had race conditions waiting
> > to happen. No locks! So I added a spinlock. In the tests there should be
> > zero contention so the lock doesn't spin.
> >
> > Result? Cost of EOID lookup (to find the real pointer of the object) went
> > from 2% to 5% of CPU time. That's how much thread safety costs. It's not
> > cheap. This got me thinking. "There has to be another way...".
> >
> > So first. What is EOID? This is a 32 or 64bit VALUE we place in a pointer
> > (yes I know technically not portable bizarre architectures - We had to do
> > this to keep ABI in EFL). This ID gives us an INDIRECT lookup to an object
> > pointer. Imagine the simplest version in your head is
> >
> > Object *obj  = all_eo_objects[eoid];
> >
> > So we take the value and check a table. The object pointer is then either a
> > valid ptr (not NULL) and the object is safe to use, or it's NULL (because eo
> > maintains the table and ensures any deleted object has it's array member
> > set to NULL). It is a little more complex than that. It's fragmented into
> > several levels of smaller tables for memory efficiency. Also there's
> > generation counts too. In fact check the attached "eoid.png" in this mail
> > for a little bit of a better explanation (there are in fact 2 more fields
> > that are used for ref and super call detection. ignore them as they are not
> > relevant).
> >
> > On 64bit we have a lot of bits to play with. On 32 it's limited. We use 2
> > bits for ref and super, so 30 left. Currently 8 bits for generation count,
> > 5 bits for the top table, 5 for the middle table, and last table has 12
> > bits (so 32 entries, 32 more than 4096 in the leaf table). Generation count
> > goes up to 255 before looping to 0 as we always do generation++ every
> > alloc. We might need to steal 2 more bits, so i was thinking take one from
> > generation count and one from the last table entry value. a bit less safe
> > and lower "max # of objects" so on 32bit max 3 objects goes down to 2
> > million and Generation count to 0 or
> > 127. Safety of that part of the check is halved. :(
> >
> > So every time you do any call in efl like:
> >
> >   efl_gfx_show(obj);
> >
> > We HAVE to resolve an EOID to a pointer. This was 2% of CPU time in the
> > genlist autobounce test, but now it's 5% thanks to needing to be thread
> > safe. This isn't great. We break up the EOID into its various fields,
> > figure out what table to access and what row, check for non-NULL pointers,
> > then compare generation count so it matches for added safety in case a slot
> > is re-cycled (the EOID without generation count is the same between a
> > deleted object and a newly created one and we accidentally use the old
> > EOID). Every allocation of an eo object makes generation count go up by 1
> > and we just use the lowest N bits as the count, so if the slot is recycled
> > the objects also have to be allocated in the same generation (unlikely). On
> > 64bit the above numbers are pretty darned big, so there isn't an issue of
> > space. Just on 32bit.
> >
> > So the real issue is this lock. Every time and object is looked up, created
> > or destroyed we need to lock and unlock all of this EOID table space and
> > related caches and generation count etc. It is just a single global lock
> > but it's not cheap since we do this SO OFTEN (yes it does have a cache to
> > store the last looked-up object, but this is global).
> >
> > What can we do? Well remove the lock? Make it cheaper? Well we can't make
> > the lock cheaper. I've looked. We MIGHT be able to make it smaller (not
> > relevant as there is a single global one), but not cheaper in cycles. More
> > fine grained locks won't help because we aren't talking contention here as
> > the issue. Literally a SUCCESSFUL lock take and then release with no
> > contention takes 3% of our CPU time.
> >
> > So here is an idea. I checked. A TLS lookup for any var is 1/5th the cost
> > or so of a lock+unlock. We could use __thread and this is FREE (no cost -
> > well it's a mov only), but this is only free in binaries not shared
> > libraries, so let's talk TLS which is much cheaper than a lock anyway. Why
> > TLS?
> >
> > 1. It will drop out cost above a lot.
> > 2. We can remove several other locks in EO too cutting some more costs.
> > 3. We can REALLY CHEAPLY enforce the rule of "you may not access an object
> > outside its owning thread". Because every thread has it's own EOID table,
> > the EOID will be local to that thread only. Looking it up in another thread
> > is looking up an "invalid" EOID. In fact we can make this pretty much
> > always fail by using some domain bits in EOID like i mentioned above (steal
> > some from table entries and/or generation count). Now literally stuff will
> > FAIL and not magically work 99% of the time if you disobey these rules and
> > access a rectangle object or a timer from another thread that doesn't own
> > them as the objects literally are not in the local thread EOID table. The
> > tread cannot see them (well is very unlikely to see them).
> >
> > So how to do this?
> >
> > 0. Move the EOID table, and associated data into a TLS structure
> > 1. Every thread HAS to do an eo_thread_init() (eo_init can do this
> > implicitly for the mainloop thread, and we can make ecore_thread/later
> > Efl.Thread do this per thread it creates for you so you don't have to).
> > 2. This thread init sets up an initial generation count at a random value so
> > generation counts can't easily be in sync and maybe swizzles the order
> > EOID's are found/allocated and so on to minimize ID "sameness".
> > 3. We take 2 more bits as a thread domain for the EOID. Mainloop thread has
> > domain 0. All other threads have domain 1 UNLESS they explicitly ask for a
> > "alternate" domain and must do so before any eo objects are created and we
> > have domain 3 and 4 for this for "special" domains.
> > 4. We make it a clear rule that threads cannot access objects outside of
> > those that they created UNLESS:
> > 4.1 An object is explicitly SENT from one thread to another (we can do this
> > later but if this is done, the object must have a refcount of 1 only, no
> > parent, no children, no objects referenced in keys, weak refts to/from this
> > object etc.). We can release the EOID entry in thread 1, but not call
> > destructor and free object memory, send the POINTER to thread 2, and here a
> > new EOID local to that thread is allocated and that pointer adopted.
> > 4.2 A whole thread is blocked and its table adopted as a secondary table.
> > This is needed for ecore_main_loop_begin/end(). This is safe because the
> > OWNING thread is paused during this time. To do this the 2 threads must NOT
> > SHARE the same domain. So a regular thread (domain 1) can always do a
> > begin/end on the mainloop as they have differing domains. You have 2 more
> > domains to use as object brokers that are less likely to have latency like
> > the mainloop.
> > 5. As long as we get the ptr of the TLS in the owning thread it will work
> > when sent to another thread, so the above sharing of 2 thread tables
> > temporarily can be done.
> > 7. It's an EO (and EFL)-wide rule that you should not make threadsafe
> > objects because EO just won't support it - you have to explicitly send
> > objects around or do a begin/end of another thread to look at it's objects
> > (and then that is limited to a thread of a different domain - we have 4 so
> > not bad).
> > 8. We need to make sure eo doesn't allow you to do things like set a child
> > of an object that is in a different domain to the parent, or with weak
> > refs, key object refs etc.
> > 9. When we do object sending we need maybe pipes with fd's with functions to
> > process "incoming objects" and have some kind of eo object with callbacks to
> > tell you when it receives a new incoming object so when sent from one
> > thread, the thread releases it's EOID but the whole data is floating around
> > UNTIL it arrives at the other end and someone needs to know WHEN it arrives
> > and what the new EOID is. So we need some kind of per-thread/loop "object
> > receiver" and have to glue this into Efl.Loop if there is one in that
> > thread.
> >
> > I have most of the nasty bits sorted out ... all but:
> >
> > When you are in a begin/end section and you see 2 EOID tables, when you
> > CREATE a new object... which one does it go into? Remember that when you
> > CALL a method on an obj it may go create objects internally too. How can
> > you determine which to use? You should be able to access both without
> > creating without issue with domains as above. You could delete fine since
> > an object knows which table it belongs to in the current thread context
> > based on domain number. They will be different. You can't bind a foreign
> > domain in if it matches yours - it'll fail. But creation is special.
> >
> > One option... if you create WITH a parent passed, the child must go into the
> > same domain automatically. Operations mixing domains in an object tree
> > should fail. What about other cases? Create a bare object with no parent...
> > you add as a child later. How to choose which domain it goes into? Local or
> > fireign? Maybe there is a context you can switch that is in your TLS that
> > tells you which to use (local or foreign table). If we have a push/pop
> > setup it'd be nice, but it's easy to get wrong. An explicit call to crate
> > with foreign and eo_add is local? So eo_foregin_add() uses the foreign
> > domain (if adopted at the time, and if not it will either fail or just use
> > local domain then). Worth thinking about.
> >
> > This buys us REALLY NICE "thread safety" in the way that objects are just
> > not allowed to span threads. They must be explicitly sent over and thus
> > ownership (and EOID value) changes, or you must explicitly do a begin/end
> > on another thread and adopt it's ID table into your local space as a
> > "foregin" table that allows you definitely "read only" access easily, even
> > the ability to modify and delete, but just creation is tricky. This really
> > will clear up lots of mistakes we have been seeing from code that uses
> > threads and does "bad things" that happen to work 99% of the time then fail
> > oddly 1% of the time. We don't have to write "is this my thread" checking
> > code in every method because the design will do that mostly for us as a
> > side-effect of TLS and normal EOID checking. This also buys us simplicity
> > when dealing with objects as we can assume the nice old fashioned way of
> > "no need to lock or consider threads - it will not be an issue", and it
> > buys us a good speedup vs what we have now. It does mean another bit of a
> > re-jigging of eo internals and we need to add some API's to be able to do
> > begin/end etc. and we can later add object sending.
> >
> > I'm rather happy with this kind of direction. We never have to make eo
> > objects thread safe or create a special eo thread safe base class. Ever.
> > You want to talk from thread to thread, then we can have endpoint objects
> > that get created with one object on one end of the msg pipe (like a
> > socketpair() or pipe()), and a different object on the other end (create in
> > one place then send one end to another thread? The object internals hook
> > them up via pipes or threadqueues?). This makes currently "incorrect and
> > dangerous code" fail early instead of 1% of the time. It catches issues
> > fast. It gets us speedups. This potentially impacts promises too, but
> > probably in a good way. This also affects bindings - looking at JS/Lua
> > specificially. If Lua had a threading model right now it'd be to have 1
> > luastate per thread, but this means we can't share objects... this means
> > eo's model is the same and you would have to detach an object from one
> > thread (luastate) and make one appear on the other end. I'm also happy that
> > I think this solves a disagreement on how to do threading. I know we have
> > to somehow and not just stick our heads in the sand. This solves it. It
> > gives a clear optimal model that is relatively robust and efficient.
> >
> > I can see benefits for memory allocation too. Specifically memory directly
> > needed for objects. We can add our own allocators for this instead of using
> > libc and these allocators DO NOT need to be threadsafe thus gaining speed
> > (well catch - if you send an object across we have to somehow deal with
> > this... unknown at the moment - an issue to consider)?
> >
> > The downsides are the extra work, the need to add API's to init eo per
> > thread, need to check for objects on thread shutdown (need the TLS free
> > func to do sanity checking for still-alive objects), the need to add all
> > this foreign vs. local table stuff for begin+end and then to hook that in,
> > and the likely need to add at least object sending, and the messaging
> > infra. Also it will use a bit more memory if you use eo across threads and
> > actually create objects in other threads. As well any mempools and other
> > things that we currently share across threads may need to become per-thread.
> >
> > So ... with that. Questions, comments, queries, devils-advocate. Find issues
> > with this. Wrap your head around it. This is very important as this impacts
> > many things subtly and some directly and clearly.
> 
> Assuming it's what we talked about on IRC, I'm happy with it. It's 
> essentially an implementation what I wanted from the beginning (Eo being 
> single threaded and if you want multi-thread, do something like 
> socketpair()) with the added coolness of enforcing IDs are unique per 
> thread. We'll probably need to change the error messages about ID not 
> found to maybe check the other threads if the ID is valid there so we 
> can say something like: "You tried accessing object X from thread Y, but 
> it's actually an object of thread Z. DENIED! Read more about foobar().".
> 
> My only concern, as always, is performance. I know you said locks are 
> slower than TLS, but while we only needed to lock on write, now we use 
> TLS on both read and write, so it is a concern.
> 
> Hope it pans out well. I'm looking forward to seeing benchmarks (and of 
> course, tests with threads and etc). :)

well it's just as slow as the spinlocks... as long as i have the optional
locking support. literally the if (tdata->shared) or if (domain == shared) then
lock+unlock ... just that alone that if causes it to go from 2.5% to 5%. without
the if we are back to almost pre-locking speed (a bit worse. w vs 2.5%) plus
all the goodies.

so choices:

1. drop the idea of shared objects at all and you HAVE to send an obj
(socketpair like above plus fd passing where ownership transfers).
2. find some way to optimize this... this is not easy. i've tried a few so far.
i'll try some more, but i suspect i'll hit a wall.
3. give up and accept the fate of the lock!

i dislike #3. #1 is the frontrunner BUT i won't go there until #2 is well
beaten into a pulp. i actually had an idea in bed that .... may help. i noticed
the asm output/dump matching the c code seems to have a LOT of the actual if
handling interspersed with the code and maybe just maybe all the if's and their
child code cause a l1 cache miss to l2  since the cachelines are not packed
with instructions we execute, so moving all the error handling to one big error
block at the end might be an idea... another idea i have as to the cost of
THESE if's it the speculative execution. the cpu might be executing BOTH
branches ready to throw away one of them ... but since one branch is a spinlock
this causes a huge cpu stall anyway as it snoops caches with other cpu's etc.
and so everything suppers as if we actually called the lock anyway. if this is
true... UGH. PAAAAIN. scratching head right now as to a solution.

-- 
------------- Codito, ergo sum - "I code, therefore I am" --------------
The Rasterman (Carsten Haitzler)    ras...@rasterman.com


------------------------------------------------------------------------------
_______________________________________________
enlightenment-devel mailing list
enlightenment-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/enlightenment-devel

Reply via email to