On Fri, 2019-10-25 at 23:11 -0500, Jonathon Anderson wrote:
> On Sat, Oct 26, 2019 at 01:50, Mark Wielaard <m...@klomp.org> wrote:
> > For example I am not sure I can proof that in resize_worker() this
> > always holds: assert(GET_STATE(resize_state) != NO_RESIZING);
> > In general the handling of the resizing_state field is pretty tricky.
> > What is the maximum number of concurrent threads it can handle?
> 
> The master thread waits for all workers to deregister before 
> transitioning from CLEANING to NO_RESIZING. Since the worker thread by 
> that point has registered (and the synchronization is arranged as such) 
> it will never see a transition all the way back to NO_RESIZING before 
> getting its word in. This way a worker thread doesn't "miss the boat" 
> and end up in an later resizing cycle.

Thanks. I think my confusion came from the fact that these states are
expressed by 2 STATE_BITs. I didn't realize they are distinct states
and cannot hold simultaneously. I somehow had the impression all states
could be "active" at the same time. The bit-fiddling with
atomic_fetch_xor_explicit is tricky.

So the busy loop to check for GET_ACTIVE_WORKERS in resize_master works
because the master itself never does an STATE_INCREMENT itself.

Are these busy loops not expensive? There are a couple in the code
where are thread is just spinning busily till all other threads are
ready and/or the state is changed.

> The maximum number of threads is a size_t minus 2 bits, so bare minimum 
> is 16384 on a 16-bit machine (which standard C still technically 
> supports). Any modern system will allow either a billion (30 bits) or 4 
> quintillion (62 bits).

Ah, right, that was also connected to my confusion about how many bits
were used to hold the state. 1 billion threads should be enough for
anybody :)

> > The other dynamic size hash is Dwarf_Sig8_Hash. This also doesn't
> > use
> > COMPARE (nor ITERATE and REVERSE). I haven't tried to replace that one
> > with the concurrent version, but I think we should. It is only used
> > when there are debug type units in the Dwarf. Which is not the default
> > (and some gcc hackers really don't like them) so you will probably not
> > encounter them normally. But we should probably make looking them up
> > also concurrent safe as a followup patch.
> 
> A quick grep -r shows that ITERATE and REVERSE are used for the 
> asm_symbol_tab. If iteration and insertion are not concurrent we can 
> easily add bidirectional iteration (using the same Treiber stack-like 
> structure as used for the memory management). Also COMPARE is not 
> defined to be (0) in this instance.

Yes. We would need some other solution for the libasm code. But I think
we can use the concurrent table for everything in libdw.

> > I only tested the single-threaded case. It is slightly slower, but not
> > very much. The previous patch made the code slightly faster, that
> > slight speed increase is gone now. But that also means it is about on
> > par with the old version without any concurrent improvements.
> > 
> > It does use a bit more memory though. For each CU the abbrev table
> > structure is 112 bytes larger. Given that there can be several 1000 
> > CUs
> > in a (large) Dwarf that means a couple of hundred K of extra memory
> > use. It is probably acceptable since such files contain 100 MBs of DIE
> > data.
> 
> That extra memory comes from the state for resizing mostly, so it 
> probably could be reclaimed. Maybe it could live on the master thread's 
> stack and then be accessed via pointer by the worker threads? If its 
> enough of an issue to consider fixing I can work out a simple patch for 
> it.

I am not too concerned by it. But haven't done much measurements. You
probably have one of the most demanding programs using this code. Are
you seeing a significant use in memory?

> > But I was wondering whether next_init_block and num_initialized_blocks
> > could be shared with next_move_block and num_moved_blocks. If I
> > understand the code in resize_helper() correctly then all threads are
> > either initializing or all threads are moving. So can't we just use 
> > one
> > next_block and one num_blocks field?
> 
> The main issue is that either a) somehow the fields have to be reset 
> between phases (which means something like a double barrier, and/or a 
> "chosen" thread), or b) the second phase has to go in reverse (which 
> has the complication of juggling the difference in ending conditions, 
> and dealing with the phase transition properly).
> 
> Not impossible, just probably too complex to be worth the 16 byte gain.

Yes. Now that I have a better handle on the state changes I see that
this is trickier than I thought. It looks like just 16 bytes, but this
is per CU. Given that there can be a thousand CUs that does add up. But
probably not worth it at this point (see also above about memory usage
measurements).

Thanks,

Mark

Reply via email to