Hello everyone, just wanted to ping back to say that I have been overwhelmed with work and will be back on this as soon as possible, most likely during July.
Best regards, -- Nuno Diegues On Tue, May 19, 2015 at 3:17 PM, Torvald Riegel <trie...@redhat.com> wrote: > On Mon, 2015-05-18 at 23:27 -0400, Nuno Diegues wrote: >> On Mon, May 18, 2015 at 5:29 PM, Torvald Riegel <trie...@redhat.com> wrote: >> > >> > Are there better options for the utility function, or can we tune it to >> > be less affected by varying txn length and likelihood of txnal vs. >> > nontxnal code? What are the things we want to avoid with the tuning? I >> > can think of: >> > * Not needing to wait for serial txns, or being aborted by a serial txn. >> > * Not retrying using HTM too much so that the retry time is larger than >> > the scalability we actually gain by having other txns commit >> > concurrently. >> >> >> Yes, those are the key points we want to make sure that do not happen. >> >> > >> > >> > Anything else? Did you consider other utility functions during your >> > research? >> >> >> The txnal vs nontxnal is indeed a completely different story. To account for >> this we would need extra book-keeping to count only cycles spent inside >> txnal code. So this would require every thread (or a sample of threads) to >> perform a rdtsc (or equivalent) on every begin/end call rather than the >> current approach of a single rdtsc per optimization round. >> >> With this type of online optimization we found that the algorithm had to be >> very simple and cheap to execute. RDTSC was a good finding to fit this, and >> it enabled us to obtain gains. Other time sources failed to do so. >> >> I do not have, out of the box, a good alternative to offer. I suppose it >> would >> take some iterations of thinking/experimenting with, just like with any >> research >> problem :) > > So let's iterate on this in parallel with the other changes that we need > to get in place. I'd prefer to have some more confidence that measuring > txn throughput in epochs is the best way forward. > > Here are a few thoughts: > > Why do we actually need to measure succeeded transactions? If a HW txn > is just getting committed without interfering with anything else, is > this really different from, say, two HW txns getting committed? Aren't > we really just interested in wasted work, or delayed txns? That may > help taking care of the nontxnal vs. txnal problem. > > Measuring committed txns during a time that might otherwise be spent by > a serial txns could be useful to figure out how much other useful work a > serial txn prevents. But we'd see that as well if we'd just go serial > during the auto-tuning because then concurrent txns will have to wait; > and in this case we could measure it in the slow path of those > concurrent txns (ie, during waiting, where measurement overhead wouldn't > matter as much). > > If a serial txn is running, concurrent txns (that wait) are free to sync > and tell the serial how much cost it creates for the concurrent txns. > There, txn length could matter, but we won't find out for real until > after the concurrent ones have run (they could be pretty short, so we > can't simply assume that the concurrent ones are as long as the serial > one, so that simply the number of concurrent ones can be used to > calculate delayed work). > >> >> > Also, note that the mitigation strategy for rdtsc >> > short-comings that you mention in the paper is not applicable in >> > general, specifically not in libitm. >> >> >> I suppose you mean the preemption of threads inflating the cycles measured? > > Yes, preemption and migration of threads (in case there's no global > sync'ed TSC or similar) -- you mentioned in the paper that you pin > threads to cores... > >> This would be similarly a problem to any time source that tries to measure >> the >> amount of work performed; not sure how we can avoid it in general. Any >> thoughts? > > Not really as long as we keep depending on measuring time in a > light-weight way. Measuring smaller time intervals could make it less > likely that preemption happens during such an interval, though. > >> >> > Another issue is that we need to implement the tuning in a portable way. >> > You currently make it depend on whether the assembler knows about RTM, >> > whereas the rest of the code makes this more portable and defers to >> > arch-specific files. I'd prefer if we could use the tuning on other >> > archs too. But for that, we need to cleanly separate generic from >> > arch-specific parts. That affects magic numbers as well as things like >> > rdtsc(). >> >> >> Yes, I refrained from adding new calls to the arch-specific files, to >> contain the >> changes mainly. But that is possible and that's part of the feedback I >> was hoping >> to get. > > OK. Let me know if you want further input regarding this. > >> > I'm wondering about whether it really makes sense to treat XABORT like >> > conflicts and other abort reasons, instead of like capacity aborts. >> > Perhaps we need to differentiate between the explicit abort codes glibc >> > uses, so that we can distinguish between cases where an abort is >> > supposed to signal incompatibility with txnal execution and cases where >> > it's just used for lock elision (see sysdeps/unix/sysv/linux/x86/hle.h >> > in current glibc): >> > #define _ABORT_LOCK_BUSY 0xff >> > #define _ABORT_LOCK_IS_LOCKED 0xfe >> > #define _ABORT_NESTED_TRYLOCK 0xfd >> >> >> I am not sure I follow: are you suggesting to consider other aborts besides >> those of capacity? > > We may want to do this with UCB, similar to capacity. Another option > would be to, for now, make fixed choices regarding whether they are > considered permanent or not. > _ABORT_NESTED_TRYLOCK is an incompatibility with txnal execution > (similar to illegal instructions and such). > _ABORT_LOCK_BUSY is failing lock elision due to the lock being already > acquired; thus, this is transient I'd say. > > Andi, what was _ABORT_LOCK_IS_LOCKED used for again? > >> > Finally, please pay attention to using the same leading tabs/spaces >> > style as current libitm code; it could be just your MUA, but it seems >> > that your patch uses just leading spaces. libitm uses 8-char tabs for >> > leading space. >> >> >> Now that I know the style, I will enforce it ;) >> Previously I had trusted my editor to follow the current style, but maybe >> it was not consistent everywhere, or it just got confused. > > Thanks :) > >> > >> > >> > > + optimizer.optimized_mode = STUBBORN; >> > > + optimizer.optimized_attempts = htm_fastpath; >> > > + optimizer.last_cycles = rdtsc(); >> > > + optimizer.last_total_txs_executed = 0; >> > > + optimizer.last_throughput = fixed_const(0.0001); >> > > + optimizer.last_attempts = htm_fastpath > 0 ? htm_fastpath - 1 : 1; >> > > + optimizer.best_ever_throughput = optimizer.last_throughput; >> > > + optimizer.best_ever_attempts = htm_fastpath; >> > > + optimizer.txns_while_stubborn = 1; >> > > + optimizer.cycles_while_stubborn = 100; >> > >> > If the assumption is that transactions can't be faster than 100 cycles, >> > please document it briefly. >> >> >> No, it does not imply that. This is an "accumulator" of cycles spent in >> the mode; it need only to be positive as far as I remember. > > OK. Please find out what the requirement is and document it (magic > numbers...). > >> > > --- libitm/beginend.cc (revision 219316) >> > > +++ libitm/beginend.cc (working copy) >> > > @@ -25,7 +25,6 @@ >> > > #include "libitm_i.h" >> > > #include <pthread.h> >> > > >> > > - >> > >> > Minor: just drop this chunk. >> >> >> What is this referring to? > > This part of the patch where all you do is delete a line. > >> > > @@ -190,7 +411,7 @@ GTM::gtm_thread::begin_transaction (uint32_t prop, >> > > #ifndef HTM_CUSTOM_FASTPATH >> > > if (likely(htm_fastpath && (prop & pr_hasNoAbort))) >> > > { >> > > - for (uint32_t t = htm_fastpath; t; t--) >> > > + for (uint32_t t = optimizer.optimized_attempts; t; t--) >> > >> > Why this change? Couldn't you keep htm_fastpath as the globally set >> > number of attempts? >> >> The idea was to keep the variables tuned within the optimizer struct. Hence, >> the optimized_attempts replaced the role of the htm_fastpath whenever the >> tuning is enabled. We can keep it in sync with the optimizer, though, by >> making >> it additionally assigned during the re-optimization with the final value of >> the >> optimized_attempts. > > I'd prefer the latter. If we end up having other configurations that > use HTM but don't use the optimizer, it would be awkward to have the > same logical setting in two locations depending on the configuration > (ie, htm_fastpath or optimized_attempts). > > Also, even though we don't do this yet, it may be easier to co-locate > the separate htm_fastpath value with something else (e.g., the serial > lock), so that we can save a cacheline of capacity in case of nested > txns. htm_fastpath is one of the few things that a HW txn must access. > >> > > @@ -665,12 +917,21 @@ _ITM_commitTransaction(void) >> > > if (likely(htm_fastpath && >> > > !gtm_thread::serial_lock.is_write_locked())) >> > > { >> > > htm_commit(); >> > > + gtm_thread *tx = gtm_thr(); >> > > + if (unlikely(tx == NULL)) >> > > + { >> > > + tx = new gtm_thread(); >> > > + set_gtm_thr(tx); >> > > + } >> > >> > Are there actually situations in which we need to create a gtm_thread? >> >> I did not check in run-time; statically, I followed the approach to >> check, which >> is quite omnipresent in the code. > > Yeah, it's necessary. Can you add a comment along the lines of: > When using the HTM fastpath, we might not have created a thread object > yet. > >> > >> > > + tx->number_executed_txns++; >> > >> > It might be nice if we can avoid this, so that we don't touch the >> > additional cacheline when we have nested txns. >> >> This is really important to have some way to measure success/progress >> to guide the tuning. > > I'm aware of that -- but is there a way to count differently (see the > utility function point above), or just count failures so that the actual > counting happens on a slow path (eg, after abort)? > >> > > --- libitm/config/x86/target.h (revision 219316) >> > > +++ libitm/config/x86/target.h (working copy) >> > > @@ -126,12 +126,25 @@ htm_abort_should_retry (uint32_t begin_ret) >> > > return begin_ret & _XABORT_RETRY; >> > > } >> > > >> > > +static inline bool >> > > +htm_is_capacity_abort(uint32_t begin_ret) >> > > +{ >> > > + return begin_ret & _XABORT_CAPACITY; >> > > +} >> > >> > Is this indeed just about capacity, or do we actually mean a more >> > general situation? >> >> The idea of the Upper Confidence Bounds (UCB) part of the algorithm was >> to detect "real" capacity aborts (i.e., permanent) vs those that are >> "transient". >> So in this case we really wanted to know if the abort was a capacity one so >> that we could direct the UCB algorithm and understand the "type" of capacity >> abort. > > OK. But do you want to single out capacity aborts as one case for which > you want to detect permanent vs. transient, or do you want to generally > find out whether aborts that are reported as permanent (e.g., capacity) > are indeed permanent? In the latter case, a more generic name would be > more appropriate. (And that's not just a naming thing, but guidance for > other archs regarding what they should put in the "capacity" bucket of > abort reasons...) > >