Hello everyone, gently pinging to bring this back to life given the last patch I emailed.
Best regards, -- Nuno Diegues On Mon, Aug 24, 2015 at 12:51 PM, Nuno Diegues <n...@ist.utl.pt> wrote: > Hello everyone, > > after a summer internship and some backlog catching up in the past > weeks, I have finally got around to review the patch according to the > latest discussion. > > The changes have not caused regression, and the latest speedup results > are coherent with what we had before. > > > In the following I add some comments inline to the discussion, and > after that paste the updated patch. > > >> > >> > So let's iterate on this in parallel with the other changes that we need >> > to get in place. I'd prefer to have some more confidence that measuring >> > txn throughput in epochs is the best way forward. >> > >> > Here are a few thoughts: >> > >> > Why do we actually need to measure succeeded transactions? If a HW txn >> > is just getting committed without interfering with anything else, is >> > this really different from, say, two HW txns getting committed? Aren't >> > we really just interested in wasted work, or delayed txns? That may >> > help taking care of the nontxnal vs. txnal problem. >> > >> > Measuring committed txns during a time that might otherwise be spent by >> > a serial txns could be useful to figure out how much other useful work a >> > serial txn prevents. But we'd see that as well if we'd just go serial >> > during the auto-tuning because then concurrent txns will have to wait; >> > and in this case we could measure it in the slow path of those >> > concurrent txns (ie, during waiting, where measurement overhead wouldn't >> > matter as much). >> > >> > If a serial txn is running, concurrent txns (that wait) are free to sync >> > and tell the serial how much cost it creates for the concurrent txns. >> > There, txn length could matter, but we won't find out for real until >> > after the concurrent ones have run (they could be pretty short, so we >> > can't simply assume that the concurrent ones are as long as the serial >> > one, so that simply the number of concurrent ones can be used to >> > calculate delayed work). > > > I understand you concern: measuring failure in a slow path rather than > success in > the fast/common path. > > My problem is that the approach taken in our work is tailored for accessing > one > given metric, in our case some form of throughput of successfully > executed atomic > blocks. > However, to measure failure, we seem to need two things: > 1) the rate of failed hardware transactions > 2) the time spent waiting for the serial lock > > In short, the performance will be bad if there is a mix of those two > issues that is bad > enough. The problem is how to define a metric that considers those two > together? > It is not obvious to me that there is one. Furthermore, that deviates > significantly from > our approach, and in the end --- even if there is a solution that > works in that way --- > that is a different approach, not one that can be applied to our > self-tuning framework. > > What we are doing right now is incrementing a counter in a (most > likely cached line) > that is single-writer and multiple-reader. Most often, it will be > owned by the single-writer. > As such, the cost should be quite negligible compared to the execution > of the atomic > block, more so in the case that a Hardware transaction was used, as > the commit itself > should be a hundred cycles or more. > > A good way to measure this impact is to use the new "htm_notune" flag > and compare > its performance with the non-changed libitm that I use as a baseline above. > The average results are similar, without any trend noticeable outside > the standard > deviation for one side or the other. > > > > >> >> > >> >> >> >> > Also, note that the mitigation strategy for rdtsc >> >> > short-comings that you mention in the paper is not applicable in >> >> > general, specifically not in libitm. >> >> >> >> >> >> I suppose you mean the preemption of threads inflating the cycles >> >> measured? >> > >> > Yes, preemption and migration of threads (in case there's no global >> > sync'ed TSC or similar) -- you mentioned in the paper that you pin >> > threads to cores... >> > >> >> This would be similarly a problem to any time source that tries to >> >> measure the >> >> amount of work performed; not sure how we can avoid it in general. Any >> >> thoughts? >> > >> > Not really as long as we keep depending on measuring time in a >> > light-weight way. Measuring smaller time intervals could make it less >> > likely that preemption happens during such an interval, though. >> > > > > Note that in this patch we use no such pinning. In fact, I've measured > that it does not > have a noticeable impact in performance. Maybe that could be the case with > heavy > over-subscription of the cores with lots of threads. > Still, it is not a correctness issue, so I believe that the fact this > performs great in the > common case should make it prevail. > > > >> >> >> >> >> > Another issue is that we need to implement the tuning in a portable way. >> >> > You currently make it depend on whether the assembler knows about RTM, >> >> > whereas the rest of the code makes this more portable and defers to >> >> > arch-specific files. I'd prefer if we could use the tuning on other >> >> > archs too. But for that, we need to cleanly separate generic from >> >> > arch-specific parts. That affects magic numbers as well as things like >> >> > rdtsc(). >> >> >> >> >> >> Yes, I refrained from adding new calls to the arch-specific files, to >> >> contain the >> >> changes mainly. But that is possible and that's part of the feedback I >> >> was hoping >> >> to get. >> > >> > OK. Let me know if you want further input regarding this. > > > I have established this arch-specific part in the new patch. PTAL > > > >> >> > >> >> > I'm wondering about whether it really makes sense to treat XABORT like >> >> > conflicts and other abort reasons, instead of like capacity aborts. >> >> > Perhaps we need to differentiate between the explicit abort codes glibc >> >> > uses, so that we can distinguish between cases where an abort is >> >> > supposed to signal incompatibility with txnal execution and cases where >> >> > it's just used for lock elision (see sysdeps/unix/sysv/linux/x86/hle.h >> >> > in current glibc): >> >> > #define _ABORT_LOCK_BUSY 0xff >> >> > #define _ABORT_LOCK_IS_LOCKED 0xfe >> >> > #define _ABORT_NESTED_TRYLOCK 0xfd >> >> >> >> >> >> I am not sure I follow: are you suggesting to consider other aborts >> >> besides >> >> those of capacity? >> > >> > We may want to do this with UCB, similar to capacity. Another option >> > would be to, for now, make fixed choices regarding whether they are >> > considered permanent or not. >> > _ABORT_NESTED_TRYLOCK is an incompatibility with txnal execution >> > (similar to illegal instructions and such). >> > _ABORT_LOCK_BUSY is failing lock elision due to the lock being already >> > acquired; thus, this is transient I'd say. >> > >> > Andi, what was _ABORT_LOCK_IS_LOCKED used for again? > > > This would seem a more or less trivial change, after it is agreed > upon. So I suggest > we leave it for later if/when this one gets through. > > > > >> >> >> > > --- libitm/config/x86/target.h (revision 219316) >> >> > > +++ libitm/config/x86/target.h (working copy) >> >> > > @@ -126,12 +126,25 @@ htm_abort_should_retry (uint32_t begin_ret) >> >> > > return begin_ret & _XABORT_RETRY; >> >> > > } >> >> > > >> >> > > +static inline bool >> >> > > +htm_is_capacity_abort(uint32_t begin_ret) >> >> > > +{ >> >> > > + return begin_ret & _XABORT_CAPACITY; >> >> > > +} >> >> > >> >> > Is this indeed just about capacity, or do we actually mean a more >> >> > general situation? >> >> >> >> The idea of the Upper Confidence Bounds (UCB) part of the algorithm was >> >> to detect "real" capacity aborts (i.e., permanent) vs those that are >> >> "transient". >> >> So in this case we really wanted to know if the abort was a capacity one >> >> so >> >> that we could direct the UCB algorithm and understand the "type" of >> >> capacity >> >> abort. >> > >> > OK. But do you want to single out capacity aborts as one case for which >> > you want to detect permanent vs. transient, or do you want to generally >> > find out whether aborts that are reported as permanent (e.g., capacity) >> > are indeed permanent? In the latter case, a more generic name would be >> > more appropriate. (And that's not just a naming thing, but guidance for >> > other archs regarding what they should put in the "capacity" bucket of >> > abort reasons...) >> > >> > > > > As far as we know, capacity aborts are the ones that fit our criteria: > they are identifiable > in a different abort category, yet they are not necessarily > persistent/deterministic. > This could be expanded to others in the future, but it is not > something we tested with, as > other abort types are not as widespread. > > > > > > Updated patch follows: > > Index: util.cc > =================================================================== > --- util.cc (revision 219316) > +++ util.cc (working copy) > @@ -94,4 +94,15 @@ xrealloc (void *old, size_t size, bool separate_cl > return r; > } > > +// State for the generation is not synchronized. It assumes single thread > +// usage at any time. It is, in fact, invoked by the thread that is > +// re-optimizing the libitm HTM usage, so the assumption holds. > +uint32_t > +random_int(uint32_t max) > +{ > + static uint32_t seed = 123456789; > + seed = (1103515245 * seed + 12345); > + return seed; > +} > + > } // namespace GTM > Index: libitm.h > =================================================================== > --- libitm.h (revision 219316) > +++ libitm.h (working copy) > @@ -101,7 +101,11 @@ typedef enum > /* These are not part of the ABI but used for custom HTM fast paths. See > ITM_beginTransaction and gtm_thread::begin_transaction. */ > pr_HTMRetryableAbort = 0x800000, > - pr_HTMRetriedAfterAbort = 0x1000000 > + pr_HTMRetriedAfterAbort = 0x1000000, > + /* Reports whether a capacity abort was detected in the custom HTM fast > + path, so that it can be used in the abort handler in > + gtm_thread::begin_transaction. */ > + pr_HTMCapacityAbort = 0x2000000 > } _ITM_codeProperties; > > /* Result from startTransaction that describes what actions to take. > Index: method-serial.cc > =================================================================== > --- method-serial.cc (revision 219316) > +++ method-serial.cc (working copy) > @@ -223,6 +223,19 @@ struct htm_mg : public method_group > // initially disabled. > #ifdef USE_HTM_FASTPATH > htm_fastpath = htm_init(); > + optimizer.optimized_mode = STUBBORN; > + optimizer.last_cycles = get_efficient_time(); > + optimizer.last_total_txs_executed = 0; > + optimizer.last_throughput = optimizer.MIN_THROUGHPUT; > + optimizer.last_attempts = htm_fastpath > 0 ? htm_fastpath - 1 : 1; > + optimizer.best_ever_throughput = optimizer.last_throughput; > + optimizer.best_ever_attempts = htm_fastpath; > + optimizer.txns_while_stubborn = 1; > + optimizer.cycles_while_stubborn = optimizer.INIT_CYCLES; > + optimizer.txns_while_halven = 1; > + optimizer.cycles_while_halven = optimizer.INIT_CYCLES; > + optimizer.txns_while_giveup = 1; > + optimizer.cycles_while_giveup = optimizer.INIT_CYCLES; > #endif > } > virtual void fini() > Index: retry.cc > =================================================================== > --- retry.cc (revision 219316) > +++ retry.cc (working copy) > @@ -218,7 +218,6 @@ GTM::gtm_thread::set_default_dispatch(GTM::abi_dis > default_dispatch.store(disp, memory_order_relaxed); > } > > - > static GTM::abi_dispatch* > parse_default_method() > { > @@ -254,10 +253,17 @@ parse_default_method() > disp = GTM::dispatch_ml_wt(); > env += 5; > } > + else if (strncmp(env, "htm_notune", 10) == 0) > + { > + disp = GTM::dispatch_htm(); > + env += 10; > + GTM::optimizer.set_is_enabled(0); > + } > else if (strncmp(env, "htm", 3) == 0) > { > disp = GTM::dispatch_htm(); > env += 3; > + GTM::optimizer.set_is_enabled(1); > } > else > goto unknown; > Index: beginend.cc > =================================================================== > --- beginend.cc (revision 219316) > +++ beginend.cc (working copy) > @@ -39,6 +39,9 @@ unsigned GTM::gtm_thread::number_of_threads = 0; > gtm_stmlock GTM::gtm_stmlock_array[LOCK_ARRAY_SIZE]; > atomic<gtm_version> GTM::gtm_clock; > > +// Holds the optimized configuration for parameters relevant to HTM > execution. > +gtm_global_optimizer GTM::optimizer; > + > /* ??? Move elsewhere when we figure out library initialization. */ > uint64_t GTM::gtm_spin_count_var = 1000; > > @@ -95,7 +98,139 @@ thread_exit_init() > GTM_fatal("Creating thread release TLS key failed."); > } > > +/* Implementations for the fixed-point arithmetic type. */ > +const uint32_t GTM::fixed_pt_t::FIXED_PT_WIDTH = 64; > +const uint32_t GTM::fixed_pt_t::FIXED_PT_INTEGER_WIDTH = 40; > +const uint32_t GTM::fixed_pt_t::FIXED_PT_FRAC_WIDTH = > + GTM::fixed_pt_t::FIXED_PT_WIDTH - GTM::fixed_pt_t::FIXED_PT_INTEGER_WIDTH; > +const fixed_pt_t GTM::fixed_pt_t::FIXED_PT_ONE = GTM::fixed_pt_t::one(); > +const fixed_pt_t GTM::fixed_pt_t::FIXED_PT_TWO = > + GTM::fixed_pt_t::FIXED_PT_ONE.add(GTM::fixed_pt_t::FIXED_PT_ONE); > > +fixed_pt_t > +GTM::fixed_pt_t::one() > +{ > + fixed_pt_t result(0); > + result.number = static_cast<uint64_t>(1) << FIXED_PT_FRAC_WIDTH; > + return result; > +} > + > +fixed_pt_t > +GTM::fixed_pt_t::add(const fixed_pt_t operand) const > +{ > + fixed_pt_t result(0); > + result.number = this->number + operand.number; > + return result; > +} > + > +fixed_pt_t > +GTM::fixed_pt_t::sub(const fixed_pt_t operand) const > +{ > + fixed_pt_t result(0); > + result.number = this->number - operand.number; > + return result; > +} > + > +fixed_pt_t > +GTM::fixed_pt_t::mul(const fixed_pt_t operand) const > +{ > + fixed_pt_t result(0); > + result.number = (static_cast<__uint128_t>(this->number) * > + static_cast<__uint128_t>(operand.number)) >> FIXED_PT_FRAC_WIDTH; > + return result; > +} > + > +fixed_pt_t > +GTM::fixed_pt_t::div(const fixed_pt_t operand) const > +{ > + fixed_pt_t result(0); > + result.number = > + (static_cast<__uint128_t>(this->number) << FIXED_PT_FRAC_WIDTH) / > + static_cast<__uint128_t>(operand.number); > + return result; > +} > + > +fixed_pt_t > +GTM::fixed_pt_t::sqrt() const > +{ > + uint64_t invert = 0; > + uint64_t iter = FIXED_PT_FRAC_WIDTH; > + uint64_t i; > + fixed_pt_t n = *this; > + > + if (n.number == 0 || n.number == FIXED_PT_ONE.number) > + { > + return n; > + } > + if (n.number < FIXED_PT_ONE.number && n.number > 6) > + { > + invert = 1; > + n = FIXED_PT_ONE.div(n); > + } > + if (n.number > FIXED_PT_ONE.number) > + { > + uint64_t s = n.number; > + iter = 0; > + while (s > 0) > + { > + s >>= 2; > + iter++; > + } > + } > + > + fixed_pt_t l(0); > + l.number = (n.number >> 1) + 1; > + for (i = 0; i < iter; i++) > + { > + l.number = l.add(n.div(l)).number >> 1; > + } > + if (invert) > + { > + return FIXED_PT_ONE.div(l); > + } > + return l; > +} > + > +fixed_pt_t > +GTM::fixed_pt_t::ln() const > +{ > + const fixed_pt_t LN2 = fixed_pt_t(0.69314718055994530942); > + const fixed_pt_t LG[7] = { > + fixed_pt_t(6.666666666666735130e-01), > + fixed_pt_t(3.999999999940941908e-01), > + fixed_pt_t(2.857142874366239149e-01), > + fixed_pt_t(2.222219843214978396e-01), > + fixed_pt_t(1.818357216161805012e-01), > + fixed_pt_t(1.531383769920937332e-01), > + fixed_pt_t(1.479819860511658591e-01) > + }; > + > + if (this->number == 0) > + { > + return 0xffffffff; > + } > + > + fixed_pt_t log2 = fixed_pt_t(0); > + fixed_pt_t xi = *this; > + while (xi.number > FIXED_PT_TWO.number) > + { > + xi.number >>= 1; > + log2.number++; > + } > + > + fixed_pt_t f = xi.sub(FIXED_PT_ONE); > + fixed_pt_t s = f.div(FIXED_PT_TWO.add(f)); > + fixed_pt_t z = s.mul(s); > + fixed_pt_t w = z.mul(z); > + fixed_pt_t R = > + w.mul(LG[1].add(w.mul(LG[3].add(w.mul(LG[5]))))).add( > + z.mul(LG[0].add(w.mul(LG[2].add( > + w.mul(LG[4].add(w.mul(LG[6])))))))); > + > + return LN2.mul(fixed_pt_t(log2.number << FIXED_PT_FRAC_WIDTH)).add( > + f.sub(s.mul(f.sub(R)))); > +} > + > GTM::gtm_thread::~gtm_thread() > { > if (nesting > 0) > @@ -149,6 +284,180 @@ choose_code_path(uint32_t prop, abi_dispatch *disp > return a_runInstrumentedCode; > } > > +// Periodically invoked by some thread(s) that are in the HTM fall-back path. > +// This implements the Upper Confidence Bounds and Gradient Descent to tune, > +// respectively, how to deal with HTM capacity aborts and how many HTM > retries > +// to allow. For more details, check Diegues et al. in USENIX ICAC 2014. > +void > +GTM::gtm_global_optimizer::reoptimize_htm_execution() > +{ > + // Avoid redundant re-optimization if another thread may be doing it > already. > + // This allows the threads to avoid racing to re-optimize concurrently, > which > + // is useless. > + > + if (unlikely(GTM::gtm_thread::serial_lock.is_write_locked())) > + { > + return; > + } > + // Eventually, some thread(s) will follow this path, even in the face of > + // spurious short-cut exits due to the serial lock being taken. > + GTM::gtm_thread::serial_lock.write_lock(); > + > + // Collect the statistics obtained so far. > + uint64_t total_txs_executed = 0; > + gtm_thread *ptr = GTM::gtm_thread::list_of_threads; > + for (; ptr; ptr = ptr->next_thread) > + { > + total_txs_executed += ptr->number_executed_txns; > + } > + > + // Obtain the delta performance with respect to the last period measured. > + // We expect the underlying architecture to provide an efficient way to > + // measure time passed since the last re-optimization. > + const uint64_t current_cycles = get_efficient_time(); > + const uint64_t cycles_used = current_cycles - optimizer.last_cycles; > + const uint64_t new_txs_executed = > + total_txs_executed - optimizer.last_total_txs_executed; > + const fixed_pt_t current_throughput = > + fixed_pt_t(new_txs_executed).div(fixed_pt_t(cycles_used)); > + > + if (current_throughput.number == 0) > + { > + // This thread probably executed right after another one as they both > + // found it was time to re-optimize, but did not contend on the serial > + // lock. As such, we avoid redundant re-optimizations and short cut > out. > + GTM::gtm_thread::serial_lock.write_unlock(); > + return; > + } > + > + optimizer.last_cycles = current_cycles; > + optimizer.last_total_txs_executed = total_txs_executed; > + if (optimizer.optimized_mode == STUBBORN) > + { > + optimizer.txns_while_stubborn += new_txs_executed; > + optimizer.cycles_while_stubborn += cycles_used; > + } > + else if (optimizer.optimized_mode == HALVEN) > + { > + optimizer.txns_while_halven += new_txs_executed; > + optimizer.cycles_while_halven += cycles_used; > + } > + else > + { > + optimizer.txns_while_giveup += new_txs_executed; > + optimizer.cycles_while_giveup += cycles_used; > + } > + > + // Compute Upper Confidence Bounds for the mode to choose next to decide on > + // how to deal with HTM capacity aborts. > + // Use fixed point arithmetic types to spare floating point usage. > + const fixed_pt_t log_sum = > + GTM::fixed_pt_t::FIXED_PT_TWO.mul( > + (fixed_pt_t(optimizer.txns_while_stubborn).add( > + fixed_pt_t(optimizer.txns_while_halven)).add( > + fixed_pt_t(optimizer.txns_while_giveup))).ln()); > + const fixed_pt_t ucb_stubborn = > + ((fixed_pt_t(optimizer.txns_while_stubborn).div( > + fixed_pt_t(optimizer.cycles_while_stubborn))).div( > + optimizer.best_ever_throughput)).add( > + (log_sum.div(fixed_pt_t(optimizer.txns_while_stubborn))).sqrt()); > + const fixed_pt_t ucb_halven = > + ((fixed_pt_t(optimizer.txns_while_halven).div( > + fixed_pt_t(optimizer.cycles_while_halven))).div( > + optimizer.best_ever_throughput)).add( > + (log_sum.div(fixed_pt_t(optimizer.txns_while_halven))).sqrt()); > + const fixed_pt_t ucb_giveup = > + ((fixed_pt_t(optimizer.txns_while_giveup).div( > + fixed_pt_t(optimizer.cycles_while_giveup)).div( > + optimizer.best_ever_throughput))).add( > + (log_sum.div(fixed_pt_t(optimizer.txns_while_giveup))).sqrt()); > + > + if (ucb_stubborn.number > ucb_halven.number && > + ucb_stubborn.number > ucb_giveup.number) > + { > + optimizer.optimized_mode = STUBBORN; > + } > + else if (ucb_halven.number > ucb_giveup.number) > + { > + optimizer.optimized_mode = HALVEN; > + } > + else > + { > + optimizer.optimized_mode = GIVEUP; > + } > + > + // Compute Gradient Descent to regulate the number of retries in HTM. > + const fixed_pt_t LARGE_DEGRADATION = fixed_pt_t(1.40); > + const fixed_pt_t MIN_IMPROVEMENT = fixed_pt_t(1.05); > + const int32_t MAX_RETRIES = 20; > + const int32_t MIN_RETRIES = 2; > + > + const fixed_pt_t change_for_better = > + current_throughput.div(optimizer.last_throughput); > + const fixed_pt_t change_for_worse = > + optimizer.last_throughput.div(current_throughput); > + int32_t last_attempts = optimizer.last_attempts; > + int32_t current_attempts = htm_fastpath; > + int32_t new_attempts = current_attempts; > + if (unlikely(change_for_worse.number > > fixed_pt_t(LARGE_DEGRADATION).number)) > + { > + htm_fastpath = optimizer.best_ever_attempts; > + optimizer.last_throughput = current_throughput; > + optimizer.last_attempts = current_attempts; > + goto finish_reoptimization; > + } > + > + if (unlikely(random_int(100) == 0)) > + { > + optimizer.last_attempts = htm_fastpath; > + optimizer.last_throughput = current_throughput; > + htm_fastpath = random_int(MAX_RETRIES - MIN_RETRIES) + MIN_RETRIES; > + goto finish_reoptimization; > + } > + > + if (change_for_better.number > MIN_IMPROVEMENT.number) > + { > + new_attempts += current_attempts - last_attempts; > + if (current_attempts == last_attempts) > + { > + new_attempts += random_int(2) == 0 ? -1 : 1; > + } > + } > + else if (change_for_worse.number > MIN_IMPROVEMENT.number) > + { > + new_attempts -= current_attempts - last_attempts; > + if (current_attempts == last_attempts) > + { > + new_attempts += random_int(2) == 0 ? -1 : 1; > + } > + } > + > + if (unlikely(new_attempts > MAX_RETRIES)) > + { > + new_attempts = MAX_RETRIES; > + } > + else if (unlikely(new_attempts < MIN_RETRIES)) > + { > + new_attempts = MIN_RETRIES; > + } > + > + if (current_attempts != new_attempts) > + { > + optimizer.last_attempts = current_attempts; > + optimizer.last_throughput = current_throughput; > + } > + > + htm_fastpath = new_attempts; > + if (unlikely(current_throughput.number > > optimizer.best_ever_throughput.number)) > + { > + optimizer.best_ever_throughput = current_throughput; > + optimizer.best_ever_attempts = current_attempts; > + } > + > + finish_reoptimization: > + GTM::gtm_thread::serial_lock.write_unlock(); > +} > + > uint32_t > GTM::gtm_thread::begin_transaction (uint32_t prop, const gtm_jmpbuf *jb) > { > @@ -209,8 +518,36 @@ GTM::gtm_thread::begin_transaction (uint32_t prop, > } > // The transaction has aborted. Don't retry if it's unlikely that > // retrying the transaction will be successful. > - if (!htm_abort_should_retry(ret)) > - break; > + if (optimizer.is_enabled) > + { > + // If the HTM self-tuning is enabled, then we act according to > + // its configuration. > + if (htm_is_capacity_abort(ret)) > + { > + gtm_capacity_abort_mode capacity_mode = > + optimizer.optimized_mode; > + // Adapt the retries according to the mode. Note that the > + // linear decrease is implicit in the outer for loop. > + if (capacity_mode == HALVEN) > + t = t / 2 + 1; > + else if (capacity_mode == GIVEUP) > + break; > + } > + > + // Take the chance of having aborted to possibly re-optimize the > + // the HTM configuration. Do this as a last resort before going > + // for the serial lock. > + if (unlikely(tx->number_executed_txns % > + optimizer.OPTIMIZATION_PERIOD_TXS == 0 && t == 1)) > + optimizer.reoptimize_htm_execution(); > + } > + else > + { > + // Execute the fixed logic when there is no adaptation. > + if (!htm_abort_should_retry(ret)) > + break; > + } > + > // Wait until any concurrent serial-mode transactions have finished. > // This is an empty critical section, but won't be elided. > if (serial_lock.is_write_locked()) > @@ -248,7 +585,11 @@ GTM::gtm_thread::begin_transaction (uint32_t prop, > // HTM fastpath aborted, and that we thus have to decide whether to retry > // the fastpath (returning a_tryHTMFastPath) or just proceed with the > // fallback method. > - if (likely(htm_fastpath && (prop & pr_HTMRetryableAbort))) > + // Even if pr_HTMRetryableAbort is not true, we still consider to retry if > + // it was a capacity abort (stated by pr_HTMCapacityAbort) and the self > + // tuning capabilities are enabled (stated by optimizer.is_enabled). > + if (likely(htm_fastpath && ((prop & pr_HTMRetryableAbort) > + || ((prop & pr_HTMCapacityAbort) && (optimizer.is_enabled))))) > { > tx = gtm_thr(); > if (unlikely(tx == NULL)) > @@ -266,6 +607,26 @@ GTM::gtm_thread::begin_transaction (uint32_t prop, > > if (--tx->restart_total > 0) > { > + // See above: we use the optimized configuration if enabled. > + if (optimizer.is_enabled) > + { > + // See above: similarly, we adapt upon capacity aborts. > + if (prop & pr_HTMCapacityAbort) > + { > + gtm_capacity_abort_mode capacity_mode = optimizer.optimized_mode; > + if (capacity_mode == HALVEN) > + tx->restart_total = tx->restart_total / 2 + 1; > + else if (capacity_mode == GIVEUP) > + goto stop_custom_htm_fastpath; > + } > + > + // See above: similarly, we periodically re-optimize. > + if (unlikely(tx->number_executed_txns % > + optimizer.OPTIMIZATION_PERIOD_TXS == 0 && > + tx->restart_total == 1)) > + optimizer.reoptimize_htm_execution(); > + } > + > // Wait until any concurrent serial-mode transactions have finished. > // Essentially the same code as above. > if (serial_lock.is_write_locked()) > @@ -665,12 +1026,23 @@ _ITM_commitTransaction(void) > if (likely(htm_fastpath && !gtm_thread::serial_lock.is_write_locked())) > { > htm_commit(); > + gtm_thread *tx = gtm_thr(); > + if (unlikely(tx == NULL)) > + { > + // See above: when using the HTM fast-path, we might not have created > + // a thread object yet. > + tx = new gtm_thread(); > + set_gtm_thr(tx); > + } > + tx->number_executed_txns++; > return; > } > #endif > gtm_thread *tx = gtm_thr(); > if (!tx->trycommit ()) > tx->restart (RESTART_VALIDATE_COMMIT); > + > + tx->number_executed_txns++; > } > > void ITM_REGPARM > @@ -681,6 +1053,14 @@ _ITM_commitTransactionEH(void *exc_ptr) > if (likely(htm_fastpath && !gtm_thread::serial_lock.is_write_locked())) > { > htm_commit(); > + gtm_thread *tx = gtm_thr(); > + if (unlikely(tx == NULL)) > + { > + // See above. > + tx = new gtm_thread(); > + set_gtm_thr(tx); > + } > + tx->number_executed_txns++; > return; > } > #endif > @@ -690,4 +1070,5 @@ _ITM_commitTransactionEH(void *exc_ptr) > tx->eh_in_flight = exc_ptr; > tx->restart (RESTART_VALIDATE_COMMIT); > } > + tx->number_executed_txns++; > } > Index: config/s390/target.h > =================================================================== > --- config/s390/target.h (revision 219316) > +++ config/s390/target.h (working copy) > @@ -116,12 +116,30 @@ htm_abort_should_retry (uint32_t begin_ret) > } > > static inline bool > +htm_is_capacity_abort (uint32_t begin_ret) > +{ > + return begin_ret != _HTM_TBEGIN_PERSISTENT; > +} > + > +static inline bool > htm_transaction_active () > { > __asm volatile (".machine \"all\" \n\t"); > return __builtin_tx_nesting_depth() != 0; > } > > +static inline uint64_t > +get_efficient_time() > +{ > + return 0; > +} > + > +static inline bool > +allows_reoptimization() > +{ > + return false; > +} > + > #endif > > } // namespace GTM > Index: config/arm/target.h > =================================================================== > --- config/arm/target.h (revision 219316) > +++ config/arm/target.h (working copy) > @@ -46,4 +46,10 @@ cpu_relax (void) > __sync_synchronize (); > } > > +static inline bool > +allows_reoptimization() > +{ > + return false; > +} > + > } // namespace GTM > Index: config/powerpc/target.h > =================================================================== > --- config/powerpc/target.h (revision 219316) > +++ config/powerpc/target.h (working copy) > @@ -74,6 +74,7 @@ cpu_relax (void) > #define _TBEGIN_STARTED 0 > #define _TBEGIN_INDETERMINATE 1 > #define _TBEGIN_PERSISTENT 2 > +#define _TBEGIN_CAPACITY 3 > > /* Number of retries for transient failures. */ > #define _HTM_RETRIES 10 > @@ -101,6 +102,9 @@ htm_begin (void) > if (_TEXASRU_FAILURE_PERSISTENT (__builtin_get_texasru ())) > return _TBEGIN_PERSISTENT; > > + if (_TEXASRU_FOOTPRINT_OVERFLOW (__builtin_get_texasru ())) > + return _TBEGIN_CAPACITY; > + > return _TBEGIN_INDETERMINATE; > } > > @@ -128,6 +132,12 @@ htm_abort_should_retry (uint32_t begin_ret) > return begin_ret != _TBEGIN_PERSISTENT; > } > > +static inline bool > +htm_is_capacity_abort (uint32_t begin_ret) > +{ > + return begin_ret == _TBEGIN_CAPACITY; > +} > + > /* Returns true iff a hardware transaction is currently being executed. */ > static inline bool > htm_transaction_active (void) > @@ -135,6 +145,36 @@ htm_transaction_active (void) > return (_HTM_STATE (__builtin_ttest ()) == _HTM_TRANSACTIONAL); > } > > +static inline uint64_t > +get_efficient_time() > +{ > + #if defined (__powerpc64__) || defined (__ppc64__) > + uint64_t __tb; > + __asm__ volatile ("mfspr %[tb], 268\n" > + : [tb]"=r" (__tb) > + : ); > + return __tb; > + #else > + register unsigned long __tbu, __tbl, __tmp; \ > + __asm__ volatile ( > + "0:\n" > + "mftbu %[tbu]\n" > + "mftbl %[tbl]\n" > + "mftbu %[tmp]\n" > + "cmpw %[tbu], %[tmp]\n" > + "bne- 0b\n" > + : [tbu]"=r" (__tbu), [tbl]"=r" (__tbl), [tmp]"=r" (__tmp) > + : ); > + return (( (uint64_t) __tbu << 32) | __tbl); > + #endif /* not __powerpc64__ */ > +} > + > +static inline bool > +allows_reoptimization() > +{ > + return true; > +} > + > #endif > > } // namespace GTM > Index: config/alpha/target.h > =================================================================== > --- config/alpha/target.h (revision 219316) > +++ config/alpha/target.h (working copy) > @@ -41,4 +41,10 @@ cpu_relax (void) > __asm volatile ("" : : : "memory"); > } > > +static inline bool > +allows_reoptimization() > +{ > + return false; > +} > + > } // namespace GTM > Index: config/x86/sjlj.S > =================================================================== > --- config/x86/sjlj.S (revision 219316) > +++ config/x86/sjlj.S (working copy) > @@ -59,12 +59,14 @@ > #define pr_hasNoAbort 0x08 > #define pr_HTMRetryableAbort 0x800000 > #define pr_HTMRetriedAfterAbort 0x1000000 > +#define pr_HTMCapacityAbort 0x2000000 > #define a_runInstrumentedCode 0x01 > #define a_runUninstrumentedCode 0x02 > #define a_tryHTMFastPath 0x20 > > #define _XABORT_EXPLICIT (1 << 0) > #define _XABORT_RETRY (1 << 1) > +#define _XABORT_CAPACITY (1 << 3) > > .text > > @@ -111,6 +113,9 @@ SYM(_ITM_beginTransaction): > testl $(_XABORT_RETRY|_XABORT_EXPLICIT), %eax > jz .Lno_htm > orl $pr_HTMRetryableAbort, %edi > + testl $(_XABORT_CAPACITY), %eax > + jz .Lno_htm > + orl $pr_HTMCapacityAbort, %edi > /* Let the C++ code handle the retry policy. */ > .Lno_htm: > #endif > Index: config/x86/target.h > =================================================================== > --- config/x86/target.h (revision 219316) > +++ config/x86/target.h (working copy) > @@ -126,13 +126,33 @@ htm_abort_should_retry (uint32_t begin_ret) > return begin_ret & _XABORT_RETRY; > } > > +static inline bool > +htm_is_capacity_abort(uint32_t begin_ret) > +{ > + return begin_ret & _XABORT_CAPACITY; > +} > + > /* Returns true iff a hardware transaction is currently being executed. */ > static inline bool > htm_transaction_active () > { > return _xtest() != 0; > } > + > +static inline uint64_t > +get_efficient_time() > +{ > + uint32_t hi, lo; > + __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi)); > + return ((unsigned long long) lo) | (((unsigned long long) hi) << 32); > +} > + > +static inline bool > +allows_reoptimization() > +{ > + return true; > +} > + > #endif > > - > } // namespace GTM > Index: config/aarch64/target.h > =================================================================== > --- config/aarch64/target.h (revision 219316) > +++ config/aarch64/target.h (working copy) > @@ -42,4 +42,10 @@ cpu_relax (void) > __asm volatile ("" : : : "memory"); > } > > +static inline bool > +allows_reoptimization() > +{ > + return false; > +} > + > } // namespace GTM > Index: config/sparc/target.h > =================================================================== > --- config/sparc/target.h (revision 219316) > +++ config/sparc/target.h (working copy) > @@ -39,4 +39,10 @@ cpu_relax (void) > __asm volatile ("rd %%ccr, %%g0" : : : "memory"); > } > > +static inline bool > +allows_reoptimization() > +{ > + return false; > +} > + > } // namespace GTM > Index: config/sh/target.h > =================================================================== > --- config/sh/target.h (revision 219316) > +++ config/sh/target.h (working copy) > @@ -44,4 +44,10 @@ cpu_relax (void) > __asm volatile ("" : : : "memory"); > } > > +static inline bool > +allows_reoptimization() > +{ > + return false; > +} > + > } // namespace GTM > Index: libitm_i.h > =================================================================== > --- libitm_i.h (revision 219316) > +++ libitm_i.h (working copy) > @@ -242,6 +242,9 @@ struct gtm_thread > uint32_t restart_reason[NUM_RESTARTS]; > uint32_t restart_total; > > + // Keeps track of how many transactions were successfully executed. > + uint64_t number_executed_txns; > + > // *** The shared part of gtm_thread starts here. *** > // Shared state is on separate cachelines to avoid false sharing with > // thread-local parts of gtm_thread. > @@ -309,6 +312,94 @@ struct gtm_thread > void commit_user_actions (); > }; > > +// Definition of fixed point arithmetic types. > +// Half the bits are dedicated to the fractional type, and the rest to the > +// "whole" part. > +struct fixed_pt_t > +{ > + uint64_t number; > + > + fixed_pt_t(double n) > + { > + this->number = static_cast<uint64_t>(n * FIXED_PT_ONE.number + (n >>= 0 ? 0.5 : -0.5)); > + } > + > + fixed_pt_t add(const fixed_pt_t operand) const; > + fixed_pt_t sub(const fixed_pt_t operand) const; > + fixed_pt_t mul(const fixed_pt_t operand) const; > + fixed_pt_t div(const fixed_pt_t operand) const; > + fixed_pt_t sqrt() const; > + fixed_pt_t ln() const; > + > + static fixed_pt_t one(); > + > + static const uint32_t FIXED_PT_WIDTH; > + static const uint32_t FIXED_PT_INTEGER_WIDTH; > + static const uint32_t FIXED_PT_FRAC_WIDTH; > + static const fixed_pt_t FIXED_PT_ONE; > + static const fixed_pt_t FIXED_PT_TWO; > +}; > + > +// Different ways to deal with capacity aborts in HTM execution. > +enum gtm_capacity_abort_mode > +{ > + STUBBORN, > + HALVEN, > + GIVEUP > +}; > + > +// Maintains the current values optimized for HTM execution and the > +// corresponding statistics gathered for the decision-making. > +struct gtm_global_optimizer > +{ > + // Whether to re-optimize the HTM execution. This is parametrizable by an > + // env variable. > + uint32_t is_enabled; > + > + // Mode chosen to currently deal with capacity aborts. > + gtm_capacity_abort_mode optimized_mode; > + // The number of attempts chosen to currently insist on HTM execution is > + // maintained in htm_fastpath. > + > + uint64_t last_cycles; > + uint64_t last_total_txs_executed; > + > + fixed_pt_t last_throughput; > + uint32_t last_attempts; > + > + fixed_pt_t best_ever_throughput; > + uint32_t best_ever_attempts; > + > + uint64_t txns_while_stubborn; > + uint64_t cycles_while_stubborn; > + uint64_t txns_while_halven; > + uint64_t cycles_while_halven; > + uint64_t txns_while_giveup; > + uint64_t cycles_while_giveup; > + > + gtm_global_optimizer() : > + last_throughput(fixed_pt_t(0.0001)), > + best_ever_throughput(fixed_pt_t(0.0001)) > + { > + } > + > + void > + set_is_enabled(bool arg) > + { > + is_enabled = allows_reoptimization() && arg; > + } > + > + void reoptimize_htm_execution(); > + > + // Period, in number of transactions, between which we execute the > + // re-optimization procedure. > + static const uint32_t OPTIMIZATION_PERIOD_TXS = 500; > + // Initialization value for the last, fictional, throughput measured. > + static const fixed_pt_t MIN_THROUGHPUT = fixed_pt_t(0.001); > + // This serves as an initialization for the cycles used so far. > + static const uint32_t INIT_CYCLES = 100; > +}; > + > } // namespace GTM > > #include "tls.h" > @@ -346,6 +437,9 @@ extern gtm_cacheline_mask gtm_mask_stack(gtm_cache > // the name, avoiding complex name mangling. > extern uint32_t htm_fastpath __asm__(UPFX "gtm_htm_fastpath"); > > +// Maintains the optimization for HTM execution. > +extern gtm_global_optimizer optimizer; > + > } // namespace GTM > > #endif // LIBITM_I_H > Index: common.h > =================================================================== > --- common.h (revision 219316) > +++ common.h (working copy) > @@ -59,6 +59,11 @@ extern void * xcalloc (size_t s, bool separate_cl > extern void * xrealloc (void *p, size_t s, bool separate_cl = false) > __attribute__((malloc, nothrow)); > > +// Linear congruential number generator. > +// > +// Used also in glibc for generators with small state. > +extern uint32_t random_int(uint32_t max); > + > } // namespace GTM