On Sat, Oct 26, 2019 at 01:50, Mark Wielaard <m...@klomp.org> wrote:
Hi,
Sorry this took so long to review. But it is pretty complex code.
I think I got how it works mostly. It is hard to proof correct though.
How did you convince yourself that the code is correct?
No worries, its a complex piece of code.
Srdan designed the hash table portions, so I might have to defer to him
on some points. That said, I'll try to answer what I can.
I think the core parallel logic can be clarified by noting:
- The total order for insertions and lookups is based on the val_ptr
atomic field(s). The hashval field(s) act as a sort of "peek" for
catching collisions before issuing a full CAS.
- Wrt resizing, the table is always in one of five phases:
NO_RESIZING, ALLOCATING_MEMORY, MOVING_DATA (initialization),
MOVING_DATA (insertion), and CLEANING. Four of those five transitions
are handled by a "master" thread chosen by a CAS on the resizing_state
field.
For example I am not sure I can proof that in resize_worker() this
always holds: assert(GET_STATE(resize_state) != NO_RESIZING);
In general the handling of the resizing_state field is pretty tricky.
What is the maximum number of concurrent threads it can handle?
The master thread waits for all workers to deregister before
transitioning from CLEANING to NO_RESIZING. Since the worker thread by
that point has registered (and the synchronization is arranged as such)
it will never see a transition all the way back to NO_RESIZING before
getting its word in. This way a worker thread doesn't "miss the boat"
and end up in an later resizing cycle.
The maximum number of threads is a size_t minus 2 bits, so bare minimum
is 16384 on a 16-bit machine (which standard C still technically
supports). Any modern system will allow either a billion (30 bits) or 4
quintillion (62 bits).
I tried to simplify the code a little. You already observed that
COMPARE can be zero. But it always is. We never try comparing values.
So all the COMPARE and value passing to the find functions can simply
be removed. So if you agree I would like to merge the attached
simplification diff into this patch.
I'm fine with it, although at a glance it seems that this means
insert_helper will never return -1. Which doesn't quite sound accurate,
so I'll have to defer to Srdan on this one.
The other dynamic size hash is Dwarf_Sig8_Hash. This also doesn't use
COMPARE (nor ITERATE and REVERSE). I haven't tried to replace that one
with the concurrent version, but I think we should. It is only used
when there are debug type units in the Dwarf. Which is not the default
(and some gcc hackers really don't like them) so you will probably not
encounter them normally. But we should probably make looking them up
also concurrent safe as a followup patch.
A quick grep -r shows that ITERATE and REVERSE are used for the
asm_symbol_tab. If iteration and insertion are not concurrent we can
easily add bidirectional iteration (using the same Treiber stack-like
structure as used for the memory management). Also COMPARE is not
defined to be (0) in this instance.
I only tested the single-threaded case. It is slightly slower, but not
very much. The previous patch made the code slightly faster, that
slight speed increase is gone now. But that also means it is about on
par with the old version without any concurrent improvements.
It does use a bit more memory though. For each CU the abbrev table
structure is 112 bytes larger. Given that there can be several 1000
CUs
in a (large) Dwarf that means a couple of hundred K of extra memory
use. It is probably acceptable since such files contain 100 MBs of DIE
data.
That extra memory comes from the state for resizing mostly, so it
probably could be reclaimed. Maybe it could live on the master thread's
stack and then be accessed via pointer by the worker threads? If its
enough of an issue to consider fixing I can work out a simple patch for
it.
But I was wondering whether next_init_block and num_initialized_blocks
could be shared with next_move_block and num_moved_blocks. If I
understand the code in resize_helper() correctly then all threads are
either initializing or all threads are moving. So can't we just use
one
next_block and one num_blocks field?
The main issue is that either a) somehow the fields have to be reset
between phases (which means something like a double barrier, and/or a
"chosen" thread), or b) the second phase has to go in reverse (which
has the complication of juggling the difference in ending conditions,
and dealing with the phase transition properly).
Not impossible, just probably too complex to be worth the 16 byte gain.
I do think the code itself is good. The above are mostly just
nitpicks.
But if you could reply and give your thoughts on them that would be
appreciated.
Thanks,
Mark