On Fri, 2019-10-25 at 23:11 -0500, Jonathon Anderson wrote: > On Sat, Oct 26, 2019 at 01:50, Mark Wielaard <m...@klomp.org> wrote: > > For example I am not sure I can proof that in resize_worker() this > > always holds: assert(GET_STATE(resize_state) != NO_RESIZING); > > In general the handling of the resizing_state field is pretty tricky. > > What is the maximum number of concurrent threads it can handle? > > The master thread waits for all workers to deregister before > transitioning from CLEANING to NO_RESIZING. Since the worker thread by > that point has registered (and the synchronization is arranged as such) > it will never see a transition all the way back to NO_RESIZING before > getting its word in. This way a worker thread doesn't "miss the boat" > and end up in an later resizing cycle.
Thanks. I think my confusion came from the fact that these states are expressed by 2 STATE_BITs. I didn't realize they are distinct states and cannot hold simultaneously. I somehow had the impression all states could be "active" at the same time. The bit-fiddling with atomic_fetch_xor_explicit is tricky. So the busy loop to check for GET_ACTIVE_WORKERS in resize_master works because the master itself never does an STATE_INCREMENT itself. Are these busy loops not expensive? There are a couple in the code where are thread is just spinning busily till all other threads are ready and/or the state is changed. > The maximum number of threads is a size_t minus 2 bits, so bare minimum > is 16384 on a 16-bit machine (which standard C still technically > supports). Any modern system will allow either a billion (30 bits) or 4 > quintillion (62 bits). Ah, right, that was also connected to my confusion about how many bits were used to hold the state. 1 billion threads should be enough for anybody :) > > The other dynamic size hash is Dwarf_Sig8_Hash. This also doesn't > > use > > COMPARE (nor ITERATE and REVERSE). I haven't tried to replace that one > > with the concurrent version, but I think we should. It is only used > > when there are debug type units in the Dwarf. Which is not the default > > (and some gcc hackers really don't like them) so you will probably not > > encounter them normally. But we should probably make looking them up > > also concurrent safe as a followup patch. > > A quick grep -r shows that ITERATE and REVERSE are used for the > asm_symbol_tab. If iteration and insertion are not concurrent we can > easily add bidirectional iteration (using the same Treiber stack-like > structure as used for the memory management). Also COMPARE is not > defined to be (0) in this instance. Yes. We would need some other solution for the libasm code. But I think we can use the concurrent table for everything in libdw. > > I only tested the single-threaded case. It is slightly slower, but not > > very much. The previous patch made the code slightly faster, that > > slight speed increase is gone now. But that also means it is about on > > par with the old version without any concurrent improvements. > > > > It does use a bit more memory though. For each CU the abbrev table > > structure is 112 bytes larger. Given that there can be several 1000 > > CUs > > in a (large) Dwarf that means a couple of hundred K of extra memory > > use. It is probably acceptable since such files contain 100 MBs of DIE > > data. > > That extra memory comes from the state for resizing mostly, so it > probably could be reclaimed. Maybe it could live on the master thread's > stack and then be accessed via pointer by the worker threads? If its > enough of an issue to consider fixing I can work out a simple patch for > it. I am not too concerned by it. But haven't done much measurements. You probably have one of the most demanding programs using this code. Are you seeing a significant use in memory? > > But I was wondering whether next_init_block and num_initialized_blocks > > could be shared with next_move_block and num_moved_blocks. If I > > understand the code in resize_helper() correctly then all threads are > > either initializing or all threads are moving. So can't we just use > > one > > next_block and one num_blocks field? > > The main issue is that either a) somehow the fields have to be reset > between phases (which means something like a double barrier, and/or a > "chosen" thread), or b) the second phase has to go in reverse (which > has the complication of juggling the difference in ending conditions, > and dealing with the phase transition properly). > > Not impossible, just probably too complex to be worth the 16 byte gain. Yes. Now that I have a better handle on the state changes I see that this is trickier than I thought. It looks like just 16 bytes, but this is per CU. Given that there can be a thousand CUs that does add up. But probably not worth it at this point (see also above about memory usage measurements). Thanks, Mark