Re: [PATCH] Make integer output faster in libgfortran
Hi fX, right now I don’t have a Linux system with 32-bit support. I’ll see how I can connect to gcc45, but if someone who is already set up to do can fire a quick regtest, that would be great;) I tested this on x86_64-pc-linux-gnu with make -k -j8 check-fortran RUNTESTFLAGS="--target_board=unix'{-m32,-m64}'" and didn't see any problems. So, OK for trunk. (We could also do something like that for a 32-bit system, but that is another kettle of fish). Thanks for taking this up! Best regards Thomas
Re: [PATCH][Hashtable 6/6] PR 68303 small size optimization
On 23/12/21 2:03 pm, Jonathan Wakely wrote: On 21/12/21 07:07 +0100, François Dumont wrote: Hi ??? Is there a chance for this patch to be integrated for next gcc release ? Yes, I think this can still make it for GCC 12 (the patch was first posted long ago in stage 1 and it's just me being slow to review it). But I have some questions and comments ... diff --git a/libstdc++-v3/include/bits/hashtable.h b/libstdc++-v3/include/bits/hashtable.h index 6e2d4c10cfe..6b2c6b99fef 100644 --- a/libstdc++-v3/include/bits/hashtable.h +++ b/libstdc++-v3/include/bits/hashtable.h @@ -419,6 +419,13 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION _M_uses_single_bucket() const { return _M_uses_single_bucket(_M_buckets); } + static constexpr size_t + __small_size_threshold() + { + return + __detail::_Hashtable_hash_traits<_Hash>::__small_size_threshold(); + } + I've added the noexcept qualification as you asked. __hashtable_alloc& _M_base_alloc() { return *this; } @@ -788,6 +795,9 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION _M_bucket_index(__hash_code __c) const { return __hash_code_base::_M_bucket_index(__c, _M_bucket_count); } + __node_base_ptr + _M_find_before_node(const key_type&); + // Find and insert helper functions and types // Find the node before the one matching the criteria. __node_base_ptr @@ -831,6 +841,9 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION __node_base_ptr _M_get_previous_node(size_type __bkt, __node_ptr __n); + pair + _M_compute_hash_code(const_iterator __hint, const key_type& __k) const; + // Insert node __n with hash code __code, in bucket __bkt if no // rehash (assumes no element with same key already present). // Takes ownership of __n if insertion succeeds, throws otherwise. @@ -1126,7 +1139,6 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION void _M_rehash(size_type __bkt_count, const __rehash_state& __state); }; - // Definitions of class template _Hashtable's out-of-line member functions. template iterator { + if (size() <= __small_size_threshold()) + { + for (auto __it = begin(); __it != end(); ++__it) + if (this->_M_key_equals(__k, *__it._M_cur)) + return __it; + return end(); + } This loop is repeated a few times, would it be better factored out into its own function? _M_find_loop or something? The return type is different in some cases, so maybe it's OK like this. Yes, I also thought about that but there is often small changes from one loop to another as you noticed. + __hash_code __code = this->_M_hash_code(__k); std::size_t __bkt = _M_bucket_index(__code); return iterator(_M_find_node(__bkt, __k, __code)); @@ -1643,6 +1663,14 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION find(const key_type& __k) const -> const_iterator { + if (size() <= __small_size_threshold()) + { + for (auto __it = begin(); __it != end(); ++__it) + if (this->_M_key_equals(__k, *__it._M_cur)) + return __it; + return end(); + } + __hash_code __code = this->_M_hash_code(__k); std::size_t __bkt = _M_bucket_index(__code); return const_iterator(_M_find_node(__bkt, __k, __code)); @@ -1855,6 +1883,35 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION } #endif + // Find the node before the one whose key compares equal to k. + // Return nullptr if no node is found. + template + auto + _Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, + _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>:: + _M_find_before_node(const key_type& __k) + -> __node_base_ptr + { + __node_base_ptr __prev_p = &_M_before_begin; This is OK now, but would need to be std::__addressof(_M_before_begin) if/when the _Hash_code_base type becomes dependent on the allocator's pointer. Yes, maybe after gcc release we will talk about those fancy pointer types again. __n_last = __n_last->_M_next(); diff --git a/libstdc++-v3/include/bits/hashtable_policy.h b/libstdc++-v3/include/bits/hashtable_policy.h index 0b5443fc187..b4a8af081ce 100644 --- a/libstdc++-v3/include/bits/hashtable_policy.h +++ b/libstdc++-v3/include/bits/hashtable_policy.h @@ -246,6 +246,20 @@ namespace __detail using __unique_keys = __bool_constant<_Unique_keys>; }; + /** + * struct _Hashtable_hash_traits + * + * Important traits for hash tables depending on associated hasher. + * + */ + template + struct _Hashtable_hash_traits + { + static constexpr std::size_t + __small_size_threshold() + { return std::__is_fast_hash<_Hash>::value ? 0 : 20; } + }; Yet another trait that nobody is ever going to specialize make me sad. I don't have a better idea though. Sure, but maybe I can document it ? I also wonder why you did not propose to make it a constant rather than requesting to add the noexcept. { @@ -1263,6 +1279,14 @@ namespace __detail const
Re: [PATCH] Make integer output faster in libgfortran
Hi Thomas, > There are two possibilities: Either use gcc45 on the compile farm, or > run it with > make -k -j8 check-fortran RUNTESTFLAGS="--target_board=unix'{-m32,-m64}'" Thanks, right now I don’t have a Linux system with 32-bit support. I’ll see how I can connect to gcc45, but if someone who is already set up to do can fire a quick regtest, that would be great ;) FX
Re: [PATCH] Make integer output faster in libgfortran
Hi FX, The patch has been bootstrapped and regtested on two 64-bit targets: aarch64-apple-darwin21 (development branch) and x86_64-pc-gnu-linux. I would like it to be tested on a 32-bit target without 128-bit integer type. Does someone have access to that? There are two possibilities: Either use gcc45 on the compile farm, or run it with make -k -j8 check-fortran RUNTESTFLAGS="--target_board=unix'{-m32,-m64}'" which is the magic incantation to also use -m32 binaries. You'll need the 32-bit support on your Linux system, of course (which you can check quickly with a "hello world" kind of program with -m32). Regards Thomas
[PATCH v3] i386: Check AX input in any_mul_highpart peepholes
When applying peephole optimization to transform mov imm, %reg0 mov %reg1, %AX_REG imul %reg0 to mov imm, %AX_REG imul %reg1 disable peephole optimization if reg1 == AX_REG. gcc/ PR bootstrap/103785 * config/i386/i386.md: Swap operand order in comments and check AX input in any_mul_highpart peepholes. gcc/testsuite/ PR bootstrap/103785 * gcc.target/i386/pr103785.c: New test. --- gcc/config/i386/i386.md | 9 -- gcc/testsuite/gcc.target/i386/pr103785.c | 38 2 files changed, 44 insertions(+), 3 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr103785.c diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 284b9507466..1c1ec227f71 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -8578,7 +8578,7 @@ (set_attr "mode" "SI")]) ;; Highpart multiplication peephole2s to tweak register allocation. -;; mov %rdx,imm; mov %rax,%rdi; imulq %rdx -> mov %rax,imm; imulq %rdi +;; mov imm,%rdx; mov %rdi,%rax; imulq %rdx -> mov imm,%rax; imulq %rdi (define_peephole2 [(set (match_operand:SWI48 0 "general_reg_operand") (match_operand:SWI48 1 "immediate_operand")) @@ -8588,7 +8588,8 @@ (any_mul_highpart:SWI48 (match_dup 2) (match_dup 0))) (clobber (match_dup 2)) (clobber (reg:CC FLAGS_REG))])] - "REGNO (operands[0]) != REGNO (operands[2]) + "REGNO (operands[3]) != AX_REG + && REGNO (operands[0]) != REGNO (operands[2]) && REGNO (operands[0]) != REGNO (operands[3]) && (REGNO (operands[0]) == REGNO (operands[4]) || peep2_reg_dead_p (3, operands[0]))" @@ -8608,7 +8609,9 @@ (any_mul_highpart:SI (match_dup 2) (match_dup 0 (clobber (match_dup 2)) (clobber (reg:CC FLAGS_REG))])] - "REGNO (operands[0]) != REGNO (operands[2]) + "REGNO (operands[3]) != AX_REG + && REGNO (operands[0]) != REGNO (operands[2]) + && REGNO (operands[2]) != REGNO (operands[3]) && REGNO (operands[0]) != REGNO (operands[3]) && (REGNO (operands[0]) == REGNO (operands[4]) || peep2_reg_dead_p (3, operands[0]))" diff --git a/gcc/testsuite/gcc.target/i386/pr103785.c b/gcc/testsuite/gcc.target/i386/pr103785.c new file mode 100644 index 000..5503b965256 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr103785.c @@ -0,0 +1,38 @@ +/* { dg-do run } */ +/* { dg-options "-O2" } */ + +#include + +struct wrapper_t +{ + long k; + long e; +}; + +struct wrapper_t **table; + +__attribute__ ((weak, regparm (2))) +void +update (long k, long e) +{ + struct wrapper_t *elmt; + + elmt = table[k % 3079]; + if (elmt == 0) +return; + elmt->e = e; +} + +int +main () +{ + table = (struct wrapper_t **) malloc (20 * sizeof (struct wrapper_t *)); + for (int i = 0; i < 20; i++) +table[i] = (struct wrapper_t *) calloc (sizeof (struct wrapper_t), 1); + if (table[10]->e != 0) +abort (); + update (10, 20); + if (table[10]->e != 20) +abort (); + return 0; +} -- 2.33.1
Re: [PATCH] Simplify integer output-related functions in libgfortran
First merry Christmas to all! Bootstrapped and regtested on x86_64-pc-linux-gnu. OK to commit? OK. Thanks for the (preliminary) patch!
[PATCH] Make integer output faster in libgfortran
Hi, Integer output in libgfortran is done by passing values as the largest integer type available. This is what our gfc_itoa() function for conversion to decimal form uses, as well, performing series of divisions by 10. On targets with a 128-bit integer type (which is most targets, really, nowadays), division is slow, because it is implemented in software and requires a call to a libgcc function. We can speed this up in two easy ways: - If the value fits into 64-bit, use a simple 64-bit itoa() function, which does the series of divisions by 10 with hardware. Most I/O will actually fall into that case, in real-life, unless you’re printing very big 128-bit integers. - If the value does not fit into 64-bit, perform only one slow division, by 10^19, and use two calls to the 64-bit function to output each part (the low part needing zero-padding). What is the speed-up? It really depends on the exact nature of the I/O done. For the most common-case, list-directed I/O with no special format, the patch does not speed (or slow!) things for values up to HUGE(KIND=4), but speeds things up for larger values. For very large 128-bit values, it can cut the I/O time in half. I attach my own timing code to this email. Results before the patch (with previous itoa-patch applied, though): Timing for INTEGER(KIND=1) Value 0, time: 0.191409990 Value HUGE(KIND=1), time: 0.173687011 Timing for INTEGER(KIND=4) Value 0, time: 0.171809018 Value 1049, time: 0.177439988 Value HUGE(KIND=4), time: 0.217984974 Timing for INTEGER(KIND=8) Value 0, time: 0.178072989 Value HUGE(KIND=4), time: 0.214841008 Value HUGE(KIND=8), time: 0.276726007 Timing for INTEGER(KIND=16) Value 0, time: 0.175235987 Value HUGE(KIND=4), time: 0.217689037 Value HUGE(KIND=8), time: 0.280257106 Value HUGE(KIND=16), time: 0.420036077 Results after the patch: Timing for INTEGER(KIND=1) Value 0, time: 0.194633007 Value HUGE(KIND=1), time: 0.172436997 Timing for INTEGER(KIND=4) Value 0, time: 0.167517006 Value 1049, time: 0.176503003 Value HUGE(KIND=4), time: 0.172892988 Timing for INTEGER(KIND=8) Value 0, time: 0.171101034 Value HUGE(KIND=4), time: 0.174461007 Value HUGE(KIND=8), time: 0.180289030 Timing for INTEGER(KIND=16) Value 0, time: 0.175765991 Value HUGE(KIND=4), time: 0.181162953 Value HUGE(KIND=8), time: 0.186082959 Value HUGE(KIND=16), time: 0.207401991 Times are CPU times in seconds, for one million integer writes into a buffer string. With the patch, we see that integer decimal output is almost independent of the value written, meaning the I/O library overhead is dominant, not the decimal conversion. For this reason, I don’t think we really need a faster implementation of the 64-bit itoa, and can keep the current series-of-division-by-10 approach. --- This patch applies on top of my previous itoa-related patch at https://gcc.gnu.org/pipermail/fortran/2021-December/057218.html The patch has been bootstrapped and regtested on two 64-bit targets: aarch64-apple-darwin21 (development branch) and x86_64-pc-gnu-linux. I would like it to be tested on a 32-bit target without 128-bit integer type. Does someone have access to that? Once tested on a 32-bit target, OK to commit? FX itoa-faster.patch Description: Binary data timing.f90 Description: Binary data
[PATCH] Simplify integer output-related functions in libgfortran
Merry Christmas! The code related to integer output in libgfortran has accumulated some… oddities over the years. I will soon post a finalized patch for faster integer-to-decimal conversion (see https://gcc.gnu.org/pipermail/fortran/2021-December/057201.html), but while working on that I found a couple of things we ought to fix, that are not directly related. So this patch is a simplification patch, a no-op. It does the following things: - gfc_itoa() is always called for nonnegative values, to make it take an unsigned arg. It allows us to simplify the code handling for negative signs (instead of doing it in two places). - fix undefined behaviour on possible overflow when negating large negative values (-HUGE-1) - all callers of write_decimal() always use gfc_itoa() as conversion function, so remove one layer of indirection: get rid of that argument, call gfc_itoa() directly inside write_decimal() - gfc_xtoa() is only used in one file anymore, so move it there and rename it to xtoa() - ztoa_big() is renamed to xtoa_big(), following the convention of other ?to_big() functions - runtime/backtrace.c is the only user of gfc_itoa() outside the I/O system; add a comment so we remember in the future why we use gfc_itoa() there… and what are its limits All this makes the code easier to understand, more consistent, probably marginally more efficient (the gfc_itoa pointer indirection), and will make the future work on speeding up gfc_itoa() easier. Bootstrapped and regtested on x86_64-pc-linux-gnu. OK to commit? FX itoa.patch Description: Binary data