Re: [PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.

Jonathon Anderson Fri, 25 Oct 2019 21:12:53 -0700



On Sat, Oct 26, 2019 at 01:50, Mark Wielaard <m...@klomp.org> wrote:

Hi,

Sorry this took so long to review. But it is pretty complex code.
I think I got how it works mostly. It is hard to proof correct though.
How did you convince yourself that the code is correct?


No worries, its a complex piece of code.

Srdan designed the hash table portions, so I might have to defer to himon some points. That said, I'll try to answer what I can.


I think the core parallel logic can be clarified by noting:

- The total order for insertions and lookups is based on the val_ptratomic field(s). The hashval field(s) act as a sort of "peek" forcatching collisions before issuing a full CAS.- Wrt resizing, the table is always in one of five phases:NO_RESIZING, ALLOCATING_MEMORY, MOVING_DATA (initialization),MOVING_DATA (insertion), and CLEANING. Four of those five transitionsare handled by a "master" thread chosen by a CAS on the resizing_statefield.


For example I am not sure I can proof that in resize_worker() this
always holds: assert(GET_STATE(resize_state) != NO_RESIZING);
In general the handling of the resizing_state field is pretty tricky.
What is the maximum number of concurrent threads it can handle?

The master thread waits for all workers to deregister beforetransitioning from CLEANING to NO_RESIZING. Since the worker thread bythat point has registered (and the synchronization is arranged as such)it will never see a transition all the way back to NO_RESIZING beforegetting its word in. This way a worker thread doesn't "miss the boat"and end up in an later resizing cycle.

The maximum number of threads is a size_t minus 2 bits, so bare minimumis 16384 on a 16-bit machine (which standard C still technicallysupports). Any modern system will allow either a billion (30 bits) or 4quintillion (62 bits).


I tried to simplify the code a little. You already observed that
COMPARE can be zero. But it always is. We never try comparing values.
So all the COMPARE and value passing to the find functions can simply
be removed. So if you agree I would like to merge the attached
simplification diff into this patch.

I'm fine with it, although at a glance it seems that this meansinsert_helper will never return -1. Which doesn't quite sound accurate,so I'll have to defer to Srdan on this one.


The other dynamic size hash is Dwarf_Sig8_Hash. This also doesn't use
COMPARE (nor ITERATE and REVERSE). I haven't tried to replace that one
with the concurrent version, but I think we should. It is only used
when there are debug type units in the Dwarf. Which is not the default
(and some gcc hackers really don't like them) so you will probably not
encounter them normally. But we should probably make looking them up
also concurrent safe as a followup patch.

A quick grep -r shows that ITERATE and REVERSE are used for theasm_symbol_tab. If iteration and insertion are not concurrent we caneasily add bidirectional iteration (using the same Treiber stack-likestructure as used for the memory management). Also COMPARE is notdefined to be (0) in this instance.


I only tested the single-threaded case. It is slightly slower, but not
very much. The previous patch made the code slightly faster, that
slight speed increase is gone now. But that also means it is about on
par with the old version without any concurrent improvements.

It does use a bit more memory though. For each CU the abbrev table

structure is 112 bytes larger. Given that there can be several 1000CUs

in a (large) Dwarf that means a couple of hundred K of extra memory
use. It is probably acceptable since such files contain 100 MBs of DIE
data.

That extra memory comes from the state for resizing mostly, so itprobably could be reclaimed. Maybe it could live on the master thread'sstack and then be accessed via pointer by the worker threads? If itsenough of an issue to consider fixing I can work out a simple patch forit.


But I was wondering whether next_init_block and num_initialized_blocks
could be shared with next_move_block and num_moved_blocks. If I
understand the code in resize_helper() correctly then all threads are

either initializing or all threads are moving. So can't we just useone

next_block and one num_blocks field?

The main issue is that either a) somehow the fields have to be resetbetween phases (which means something like a double barrier, and/or a"chosen" thread), or b) the second phase has to go in reverse (whichhas the complication of juggling the difference in ending conditions,and dealing with the phase transition properly).


Not impossible, just probably too complex to be worth the 16 byte gain.

I do think the code itself is good. The above are mostly justnitpicks.
But if you could reply and give your thoughts on them that would be
appreciated.

Thanks,

Mark

Re: [PATCH 3/3] lib + libdw: Add and use a concurrent version of the dynamic-size hash table.

Reply via email to