Re: [PATCH] add self-tuning to x86 hardware fast path in libitm

Nuno Diegues Sun, 14 Jun 2015 14:42:59 -0700

Hello everyone,

just wanted to ping back to say that I have been overwhelmed with work
and will be back on this as soon as possible, most likely during July.



Best regards,
-- Nuno Diegues

On Tue, May 19, 2015 at 3:17 PM, Torvald Riegel <trie...@redhat.com> wrote:
> On Mon, 2015-05-18 at 23:27 -0400, Nuno Diegues wrote:
>> On Mon, May 18, 2015 at 5:29 PM, Torvald Riegel <trie...@redhat.com> wrote:
>> >
>> > Are there better options for the utility function, or can we tune it to
>> > be less affected by varying txn length and likelihood of txnal vs.
>> > nontxnal code?  What are the things we want to avoid with the tuning?  I
>> > can think of:
>> > * Not needing to wait for serial txns, or being aborted by a serial txn.
>> > * Not retrying using HTM too much so that the retry time is larger than
>> > the scalability we actually gain by having other txns commit
>> > concurrently.
>>
>>
>> Yes, those are the key points we want to make sure that do not happen.
>>
>> >
>> >
>> > Anything else?  Did you consider other utility functions during your
>> > research?
>>
>>
>> The txnal vs nontxnal is indeed a completely different story. To account for
>> this we would need extra book-keeping to count only cycles spent inside
>> txnal code. So this would require every thread (or a sample of threads) to
>> perform a rdtsc (or equivalent) on every begin/end call rather than the
>> current approach of a single rdtsc per optimization round.
>>
>> With this type of online optimization we found that the algorithm had to be
>> very simple and cheap to execute. RDTSC was a good finding to fit this, and
>> it enabled us to obtain gains. Other time sources failed to do so.
>>
>> I do not have, out of the box, a good alternative to offer. I suppose it 
>> would
>> take some iterations of thinking/experimenting with, just like with any 
>> research
>> problem :)
>
> So let's iterate on this in parallel with the other changes that we need
> to get in place.  I'd prefer to have some more confidence that measuring
> txn throughput in epochs is the best way forward.
>
> Here are a few thoughts:
>
> Why do we actually need to measure succeeded transactions?  If a HW txn
> is just getting committed without interfering with anything else, is
> this really different from, say, two HW txns getting committed?  Aren't
> we really just interested in wasted work, or delayed txns?  That may
> help taking care of the nontxnal vs. txnal problem.
>
> Measuring committed txns during a time that might otherwise be spent by
> a serial txns could be useful to figure out how much other useful work a
> serial txn prevents.  But we'd see that as well if we'd just go serial
> during the auto-tuning because then concurrent txns will have to wait;
> and in this case we could measure it in the slow path of those
> concurrent txns (ie, during waiting, where measurement overhead wouldn't
> matter as much).
>
> If a serial txn is running, concurrent txns (that wait) are free to sync
> and tell the serial how much cost it creates for the concurrent txns.
> There, txn length could matter, but we won't find out for real until
> after the concurrent ones have run (they could be pretty short, so we
> can't simply assume that the concurrent ones are as long as the serial
> one, so that simply the number of concurrent ones can be used to
> calculate delayed work).
>
>>
>> > Also, note that the mitigation strategy for rdtsc
>> > short-comings that you mention in the paper is not applicable in
>> > general, specifically not in libitm.
>>
>>
>> I suppose you mean the preemption of threads inflating the cycles measured?
>
> Yes, preemption and migration of threads (in case there's no global
> sync'ed TSC or similar) -- you mentioned in the paper that you pin
> threads to cores...
>
>> This would be similarly a problem to any time source that tries to measure 
>> the
>> amount of work performed; not sure how we can avoid it in general. Any 
>> thoughts?
>
> Not really as long as we keep depending on measuring time in a
> light-weight way.  Measuring smaller time intervals could make it less
> likely that preemption happens during such an interval, though.
>
>>
>> > Another issue is that we need to implement the tuning in a portable way.
>> > You currently make it depend on whether the assembler knows about RTM,
>> > whereas the rest of the code makes this more portable and defers to
>> > arch-specific files.  I'd prefer if we could use the tuning on other
>> > archs too.  But for that, we need to cleanly separate generic from
>> > arch-specific parts.  That affects magic numbers as well as things like
>> > rdtsc().
>>
>>
>> Yes, I refrained from adding new calls to the arch-specific files, to
>> contain the
>> changes mainly. But that is possible and that's part of the feedback I
>> was hoping
>> to get.
>
> OK.  Let me know if you want further input regarding this.
>
>> > I'm wondering about whether it really makes sense to treat XABORT like
>> > conflicts and other abort reasons, instead of like capacity aborts.
>> > Perhaps we need to differentiate between the explicit abort codes glibc
>> > uses, so that we can distinguish between cases where an abort is
>> > supposed to signal incompatibility with txnal execution and cases where
>> > it's just used for lock elision (see sysdeps/unix/sysv/linux/x86/hle.h
>> > in current glibc):
>> > #define _ABORT_LOCK_BUSY        0xff
>> > #define _ABORT_LOCK_IS_LOCKED   0xfe
>> > #define _ABORT_NESTED_TRYLOCK   0xfd
>>
>>
>> I am not sure I follow: are you suggesting to consider other aborts besides
>> those of capacity?
>
> We may want to do this with UCB, similar to capacity.  Another option
> would be to, for now, make fixed choices regarding whether they are
> considered permanent or not.
> _ABORT_NESTED_TRYLOCK is an incompatibility with txnal execution
> (similar to illegal instructions and such).
> _ABORT_LOCK_BUSY is failing lock elision due to the lock being already
> acquired; thus, this is transient I'd say.
>
> Andi, what was _ABORT_LOCK_IS_LOCKED used for again?
>
>> > Finally, please pay attention to using the same leading tabs/spaces
>> > style as current libitm code; it could be just your MUA, but it seems
>> > that your patch uses just leading spaces.  libitm uses 8-char tabs for
>> > leading space.
>>
>>
>> Now that I know the style, I will enforce it ;)
>> Previously I had trusted my editor to follow the current style, but maybe
>> it was not consistent everywhere, or it just got confused.
>
> Thanks :)
>
>> >
>> >
>> > > +    optimizer.optimized_mode = STUBBORN;
>> > > +    optimizer.optimized_attempts = htm_fastpath;
>> > > +    optimizer.last_cycles = rdtsc();
>> > > +    optimizer.last_total_txs_executed = 0;
>> > > +    optimizer.last_throughput = fixed_const(0.0001);
>> > > +    optimizer.last_attempts = htm_fastpath > 0 ? htm_fastpath - 1 : 1;
>> > > +    optimizer.best_ever_throughput = optimizer.last_throughput;
>> > > +    optimizer.best_ever_attempts = htm_fastpath;
>> > > +    optimizer.txns_while_stubborn = 1;
>> > > +    optimizer.cycles_while_stubborn = 100;
>> >
>> > If the assumption is that transactions can't be faster than 100 cycles,
>> > please document it briefly.
>>
>>
>> No, it does not imply that. This is an "accumulator" of cycles spent in
>> the mode; it need only to be positive as far as I remember.
>
> OK.  Please find out what the requirement is and document it (magic
> numbers...).
>
>> > > --- libitm/beginend.cc (revision 219316)
>> > > +++ libitm/beginend.cc (working copy)
>> > > @@ -25,7 +25,6 @@
>> > >  #include "libitm_i.h"
>> > >  #include <pthread.h>
>> > >
>> > > -
>> >
>> > Minor: just drop this chunk.
>>
>>
>> What is this referring to?
>
> This part of the patch where all you do is delete a line.
>
>> > > @@ -190,7 +411,7 @@ GTM::gtm_thread::begin_transaction (uint32_t prop,
>> > >  #ifndef HTM_CUSTOM_FASTPATH
>> > >    if (likely(htm_fastpath && (prop & pr_hasNoAbort)))
>> > >      {
>> > > -      for (uint32_t t = htm_fastpath; t; t--)
>> > > +      for (uint32_t t = optimizer.optimized_attempts; t; t--)
>> >
>> > Why this change?  Couldn't you keep htm_fastpath as the globally set
>> > number of attempts?
>>
>> The idea was to keep the variables tuned within the optimizer struct. Hence,
>> the optimized_attempts replaced the role of the htm_fastpath whenever the
>> tuning is enabled. We can keep it in sync with the optimizer, though, by 
>> making
>> it additionally assigned during the re-optimization with the final value of 
>> the
>> optimized_attempts.
>
> I'd prefer the latter.  If we end up having other configurations that
> use HTM but don't use the optimizer, it would be awkward to have the
> same logical setting in two locations depending on the configuration
> (ie, htm_fastpath or optimized_attempts).
>
> Also, even though we don't do this yet, it may be easier to co-locate
> the separate htm_fastpath value with something else (e.g., the serial
> lock), so that we can save a cacheline of capacity in case of nested
> txns.  htm_fastpath is one of the few things that a HW txn must access.
>
>> > > @@ -665,12 +917,21 @@ _ITM_commitTransaction(void)
>> > >    if (likely(htm_fastpath && 
>> > > !gtm_thread::serial_lock.is_write_locked()))
>> > >      {
>> > >        htm_commit();
>> > > +      gtm_thread *tx = gtm_thr();
>> > > +      if (unlikely(tx == NULL))
>> > > +        {
>> > > +          tx = new gtm_thread();
>> > > +          set_gtm_thr(tx);
>> > > +        }
>> >
>> > Are there actually situations in which we need to create a gtm_thread?
>>
>> I did not check in run-time; statically, I followed the approach to
>> check, which
>> is quite omnipresent in the code.
>
> Yeah, it's necessary.  Can you add a comment along the lines of:
> When using the HTM fastpath, we might not have created a thread object
> yet.
>
>> >
>> > > +      tx->number_executed_txns++;
>> >
>> > It might be nice if we can avoid this, so that we don't touch the
>> > additional cacheline when we have nested txns.
>>
>> This is really important to have some way to measure success/progress
>> to guide the tuning.
>
> I'm aware of that -- but is there a way to count differently (see the
> utility function point above), or just count failures so that the actual
> counting happens on a slow path (eg, after abort)?
>
>> > > --- libitm/config/x86/target.h (revision 219316)
>> > > +++ libitm/config/x86/target.h (working copy)
>> > > @@ -126,12 +126,25 @@ htm_abort_should_retry (uint32_t begin_ret)
>> > >    return begin_ret & _XABORT_RETRY;
>> > >  }
>> > >
>> > > +static inline bool
>> > > +htm_is_capacity_abort(uint32_t begin_ret)
>> > > +{
>> > > +  return begin_ret & _XABORT_CAPACITY;
>> > > +}
>> >
>> > Is this indeed just about capacity, or do we actually mean a more
>> > general situation?
>>
>> The idea of the Upper Confidence Bounds (UCB) part of the algorithm was
>> to detect "real" capacity aborts (i.e., permanent) vs those that are
>> "transient".
>> So in this case we really wanted to know if the abort was a capacity one so
>> that we could direct the UCB algorithm and understand the "type" of capacity
>> abort.
>
> OK.  But do you want to single out capacity aborts as one case for which
> you want to detect permanent vs. transient, or do you want to generally
> find out whether aborts that are reported as permanent (e.g., capacity)
> are indeed permanent?  In the latter case, a more generic name would be
> more appropriate.  (And that's not just a naming thing, but guidance for
> other archs regarding what they should put in the "capacity" bucket of
> abort reasons...)
>
>

Re: [PATCH] add self-tuning to x86 hardware fast path in libitm

Reply via email to