Re: [PATCH] Make integer output faster in libgfortran

2021-12-25 Thread Thomas Koenig via Gcc-patches

Hi fX,


right now I don’t have a Linux system with 32-bit support. I’ll see how I can 
connect to gcc45, but if someone who is already set up to do can fire a quick 
regtest, that would be great;)


I tested this on x86_64-pc-linux-gnu with

make -k -j8 check-fortran RUNTESTFLAGS="--target_board=unix'{-m32,-m64}'"

and didn't see any problems.

So, OK for trunk.

(We could also do something like that for a 32-bit system, but
that is another kettle of fish).

Thanks for taking this up!

Best regards

Thomas



Re: [PATCH][Hashtable 6/6] PR 68303 small size optimization

2021-12-25 Thread François Dumont via Gcc-patches

On 23/12/21 2:03 pm, Jonathan Wakely wrote:

On 21/12/21 07:07 +0100, François Dumont wrote:

Hi

??? Is there a chance for this patch to be integrated for next gcc
release ?


Yes, I think this can still make it for GCC 12 (the patch was first
posted long ago in stage 1 and it's just me being slow to review it).

But I have some questions and comments ...


diff --git a/libstdc++-v3/include/bits/hashtable.h 
b/libstdc++-v3/include/bits/hashtable.h

index 6e2d4c10cfe..6b2c6b99fef 100644
--- a/libstdc++-v3/include/bits/hashtable.h
+++ b/libstdc++-v3/include/bits/hashtable.h
@@ -419,6 +419,13 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
  _M_uses_single_bucket() const
  { return _M_uses_single_bucket(_M_buckets); }

+  static constexpr size_t
+  __small_size_threshold()
+  {
+    return
+ __detail::_Hashtable_hash_traits<_Hash>::__small_size_threshold();
+  }
+

I've added the noexcept qualification as you asked.

  __hashtable_alloc&
  _M_base_alloc() { return *this; }

@@ -788,6 +795,9 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
  _M_bucket_index(__hash_code __c) const
  { return __hash_code_base::_M_bucket_index(__c, 
_M_bucket_count); }


+  __node_base_ptr
+  _M_find_before_node(const key_type&);
+
  // Find and insert helper functions and types
  // Find the node before the one matching the criteria.
  __node_base_ptr
@@ -831,6 +841,9 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
  __node_base_ptr
  _M_get_previous_node(size_type __bkt, __node_ptr __n);

+  pair
+  _M_compute_hash_code(const_iterator __hint, const key_type& 
__k) const;

+
  // Insert node __n with hash code __code, in bucket __bkt if no
  // rehash (assumes no element with same key already present).
  // Takes ownership of __n if insertion succeeds, throws otherwise.
@@ -1126,7 +1139,6 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
  void _M_rehash(size_type __bkt_count, const __rehash_state& 
__state);

    };

-
  // Definitions of class template _Hashtable's out-of-line member 
functions.

  template iterator
    {
+  if (size() <= __small_size_threshold())
+    {
+  for (auto __it = begin(); __it != end(); ++__it)
+    if (this->_M_key_equals(__k, *__it._M_cur))
+  return __it;
+  return end();
+    }


This loop is repeated a few times, would it be better factored out
into its own function? _M_find_loop or something? The return type is
different in some cases, so maybe it's OK like this.
Yes, I also thought about that but there is often small changes from one 
loop to another as you noticed.





+
  __hash_code __code = this->_M_hash_code(__k);
  std::size_t __bkt = _M_bucket_index(__code);
  return iterator(_M_find_node(__bkt, __k, __code));
@@ -1643,6 +1663,14 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
    find(const key_type& __k) const
    -> const_iterator
    {
+  if (size() <= __small_size_threshold())
+    {
+  for (auto __it = begin(); __it != end(); ++__it)
+    if (this->_M_key_equals(__k, *__it._M_cur))
+  return __it;
+  return end();
+    }
+
  __hash_code __code = this->_M_hash_code(__k);
  std::size_t __bkt = _M_bucket_index(__code);
  return const_iterator(_M_find_node(__bkt, __k, __code));
@@ -1855,6 +1883,35 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
  }
#endif

+  // Find the node before the one whose key compares equal to k.
+  // Return nullptr if no node is found.
+  template
+    auto
+    _Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal,
+   _Hash, _RangeHash, _Unused, _RehashPolicy, _Traits>::
+    _M_find_before_node(const key_type& __k)
+    -> __node_base_ptr
+    {
+  __node_base_ptr __prev_p = &_M_before_begin;


This is OK now, but would need to be std::__addressof(_M_before_begin)
if/when the _Hash_code_base type becomes dependent on the allocator's
pointer.
Yes, maybe after gcc release we will talk about those fancy pointer 
types again.


 __n_last = __n_last->_M_next();
diff --git a/libstdc++-v3/include/bits/hashtable_policy.h 
b/libstdc++-v3/include/bits/hashtable_policy.h

index 0b5443fc187..b4a8af081ce 100644
--- a/libstdc++-v3/include/bits/hashtable_policy.h
+++ b/libstdc++-v3/include/bits/hashtable_policy.h
@@ -246,6 +246,20 @@ namespace __detail
  using __unique_keys = __bool_constant<_Unique_keys>;
    };

+  /**
+   *  struct _Hashtable_hash_traits
+   *
+   *  Important traits for hash tables depending on associated hasher.
+   *
+   */
+  template
+    struct _Hashtable_hash_traits
+    {
+  static constexpr std::size_t
+  __small_size_threshold()
+  { return std::__is_fast_hash<_Hash>::value ? 0 : 20; }
+    };


Yet another trait that nobody is ever going to specialize make me sad.
I don't have a better idea though.


Sure, but maybe I can document it ?

I also wonder why you did not propose to make it a constant rather than 
requesting to add the noexcept.




   {

@@ -1263,6 +1279,14 @@ namespace __detail
    const 

Re: [PATCH] Make integer output faster in libgfortran

2021-12-25 Thread FX via Gcc-patches
Hi Thomas,

> There are two possibilities: Either use gcc45 on the compile farm, or
> run it with
> make -k -j8 check-fortran RUNTESTFLAGS="--target_board=unix'{-m32,-m64}'"

Thanks, right now I don’t have a Linux system with 32-bit support. I’ll see how 
I can connect to gcc45, but if someone who is already set up to do can fire a 
quick regtest, that would be great ;)

FX

Re: [PATCH] Make integer output faster in libgfortran

2021-12-25 Thread Thomas Koenig via Gcc-patches

Hi FX,


The patch has been bootstrapped and regtested on two 64-bit targets: 
aarch64-apple-darwin21 (development branch) and x86_64-pc-gnu-linux. I would 
like it to be tested on a 32-bit target without 128-bit integer type. Does 
someone have access to that?


There are two possibilities: Either use gcc45 on the compile farm, or
run it with

make -k -j8 check-fortran RUNTESTFLAGS="--target_board=unix'{-m32,-m64}'"

which is the magic incantation to also use -m32 binaries.  You'll need
the 32-bit support on your Linux system, of course (which you can
check quickly with a "hello world" kind of program with -m32).

Regards

Thomas


[PATCH v3] i386: Check AX input in any_mul_highpart peepholes

2021-12-25 Thread H.J. Lu via Gcc-patches
When applying peephole optimization to transform

mov imm, %reg0
mov %reg1, %AX_REG
imul %reg0

to

mov imm, %AX_REG
imul %reg1

disable peephole optimization if reg1 == AX_REG.

gcc/

PR bootstrap/103785
* config/i386/i386.md: Swap operand order in comments and check
AX input in any_mul_highpart peepholes.

gcc/testsuite/

PR bootstrap/103785
* gcc.target/i386/pr103785.c: New test.
---
 gcc/config/i386/i386.md  |  9 --
 gcc/testsuite/gcc.target/i386/pr103785.c | 38 
 2 files changed, 44 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr103785.c

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 284b9507466..1c1ec227f71 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -8578,7 +8578,7 @@
(set_attr "mode" "SI")])
 
 ;; Highpart multiplication peephole2s to tweak register allocation.
-;; mov %rdx,imm; mov %rax,%rdi; imulq %rdx  ->  mov %rax,imm; imulq %rdi
+;; mov imm,%rdx; mov %rdi,%rax; imulq %rdx  ->  mov imm,%rax; imulq %rdi
 (define_peephole2
   [(set (match_operand:SWI48 0 "general_reg_operand")
(match_operand:SWI48 1 "immediate_operand"))
@@ -8588,7 +8588,8 @@
   (any_mul_highpart:SWI48 (match_dup 2) (match_dup 0)))
  (clobber (match_dup 2))
  (clobber (reg:CC FLAGS_REG))])]
-  "REGNO (operands[0]) != REGNO (operands[2])
+  "REGNO (operands[3]) != AX_REG
+   && REGNO (operands[0]) != REGNO (operands[2])
&& REGNO (operands[0]) != REGNO (operands[3])
&& (REGNO (operands[0]) == REGNO (operands[4])
|| peep2_reg_dead_p (3, operands[0]))"
@@ -8608,7 +8609,9 @@
 (any_mul_highpart:SI (match_dup 2) (match_dup 0
  (clobber (match_dup 2))
  (clobber (reg:CC FLAGS_REG))])]
-  "REGNO (operands[0]) != REGNO (operands[2])
+  "REGNO (operands[3]) != AX_REG
+   && REGNO (operands[0]) != REGNO (operands[2])
+   && REGNO (operands[2]) != REGNO (operands[3])
&& REGNO (operands[0]) != REGNO (operands[3])
&& (REGNO (operands[0]) == REGNO (operands[4])
|| peep2_reg_dead_p (3, operands[0]))"
diff --git a/gcc/testsuite/gcc.target/i386/pr103785.c 
b/gcc/testsuite/gcc.target/i386/pr103785.c
new file mode 100644
index 000..5503b965256
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr103785.c
@@ -0,0 +1,38 @@
+/* { dg-do run } */
+/* { dg-options "-O2" } */
+
+#include 
+
+struct wrapper_t
+{
+  long k;
+  long e;
+};
+
+struct wrapper_t **table;
+
+__attribute__ ((weak, regparm (2)))
+void
+update (long k, long e)
+{
+  struct wrapper_t *elmt;
+
+  elmt = table[k % 3079];
+  if (elmt == 0)
+return;
+  elmt->e = e;
+}
+
+int
+main ()
+{
+  table = (struct wrapper_t **) malloc (20 * sizeof (struct wrapper_t *));
+  for (int i = 0; i < 20; i++)
+table[i] = (struct wrapper_t *) calloc (sizeof (struct wrapper_t), 1);
+  if (table[10]->e != 0)
+abort ();
+  update (10, 20);
+  if (table[10]->e != 20)
+abort ();
+  return 0;
+}
-- 
2.33.1



Re: [PATCH] Simplify integer output-related functions in libgfortran

2021-12-25 Thread Thomas Koenig via Gcc-patches

First merry Christmas to all!


Bootstrapped and regtested on x86_64-pc-linux-gnu.
OK to commit?


OK.

Thanks for the (preliminary) patch!


[PATCH] Make integer output faster in libgfortran

2021-12-25 Thread FX via Gcc-patches
Hi,

Integer output in libgfortran is done by passing values as the largest integer 
type available. This is what our gfc_itoa() function for conversion to decimal 
form uses, as well, performing series of divisions by 10. On targets with a 
128-bit integer type (which is most targets, really, nowadays), division is 
slow, because it is implemented in software and requires a call to a libgcc 
function.

We can speed this up in two easy ways:
- If the value fits into 64-bit, use a simple 64-bit itoa() function, which 
does the series of divisions by 10 with hardware. Most I/O will actually fall 
into that case, in real-life, unless you’re printing very big 128-bit integers.
- If the value does not fit into 64-bit, perform only one slow division, by 
10^19, and use two calls to the 64-bit function to output each part (the low 
part needing zero-padding).


What is the speed-up? It really depends on the exact nature of the I/O done. 
For the most common-case, list-directed I/O with no special format, the patch 
does not speed (or slow!) things for values up to HUGE(KIND=4), but speeds 
things up for larger values. For very large 128-bit values, it can cut the I/O 
time in half.

I attach my own timing code to this email. Results before the patch (with 
previous itoa-patch applied, though):

 Timing for INTEGER(KIND=1)
 Value 0, time:  0.191409990
 Value HUGE(KIND=1), time:  0.173687011
 Timing for INTEGER(KIND=4)
 Value 0, time:  0.171809018
 Value 1049, time:  0.177439988
 Value HUGE(KIND=4), time:  0.217984974
 Timing for INTEGER(KIND=8)
 Value 0, time:  0.178072989
 Value HUGE(KIND=4), time:  0.214841008
 Value HUGE(KIND=8), time:  0.276726007
 Timing for INTEGER(KIND=16)
 Value 0, time:  0.175235987
 Value HUGE(KIND=4), time:  0.217689037
 Value HUGE(KIND=8), time:  0.280257106
 Value HUGE(KIND=16), time:  0.420036077

Results after the patch:

 Timing for INTEGER(KIND=1)
 Value 0, time:  0.194633007
 Value HUGE(KIND=1), time:  0.172436997
 Timing for INTEGER(KIND=4)
 Value 0, time:  0.167517006
 Value 1049, time:  0.176503003
 Value HUGE(KIND=4), time:  0.172892988
 Timing for INTEGER(KIND=8)
 Value 0, time:  0.171101034
 Value HUGE(KIND=4), time:  0.174461007
 Value HUGE(KIND=8), time:  0.180289030
 Timing for INTEGER(KIND=16)
 Value 0, time:  0.175765991
 Value HUGE(KIND=4), time:  0.181162953
 Value HUGE(KIND=8), time:  0.186082959
 Value HUGE(KIND=16), time:  0.207401991

Times are CPU times in seconds, for one million integer writes into a buffer 
string. With the patch, we see that integer decimal output is almost 
independent of the value written, meaning the I/O library overhead is dominant, 
not the decimal conversion. For this reason, I don’t think we really need a 
faster implementation of the 64-bit itoa, and can keep the current 
series-of-division-by-10 approach.

---

This patch applies on top of my previous itoa-related patch at 
https://gcc.gnu.org/pipermail/fortran/2021-December/057218.html

The patch has been bootstrapped and regtested on two 64-bit targets: 
aarch64-apple-darwin21 (development branch) and x86_64-pc-gnu-linux. I would 
like it to be tested on a 32-bit target without 128-bit integer type. Does 
someone have access to that?

Once tested on a 32-bit target, OK to commit?

FX



itoa-faster.patch
Description: Binary data


timing.f90
Description: Binary data


[PATCH] Simplify integer output-related functions in libgfortran

2021-12-25 Thread FX via Gcc-patches
Merry Christmas!

The code related to integer output in libgfortran has accumulated some… 
oddities over the years. I will soon post a finalized patch for faster 
integer-to-decimal conversion (see 
https://gcc.gnu.org/pipermail/fortran/2021-December/057201.html), but while 
working on that I found a couple of things we ought to fix, that are not 
directly related.

So this patch is a simplification patch, a no-op. It does the following things:

 - gfc_itoa() is always called for nonnegative values, to make it take an 
unsigned arg. It allows us to simplify the code handling for negative signs 
(instead of doing it in two places).
 - fix undefined behaviour on possible overflow when negating large negative 
values (-HUGE-1)
 - all callers of write_decimal() always use gfc_itoa() as conversion function, 
so remove one layer of indirection: get rid of that argument, call gfc_itoa() 
directly inside write_decimal()
 - gfc_xtoa() is only used in one file anymore, so move it there and rename it 
to xtoa()
 - ztoa_big() is renamed to xtoa_big(), following the convention of other 
?to_big() functions
 - runtime/backtrace.c is the only user of gfc_itoa() outside the I/O system; 
add a comment so we remember in the future why we use gfc_itoa() there… and 
what are its limits

All this makes the code easier to understand, more consistent, probably 
marginally more efficient (the gfc_itoa pointer indirection), and will make the 
future work on speeding up gfc_itoa() easier.

Bootstrapped and regtested on x86_64-pc-linux-gnu.
OK to commit?

FX



itoa.patch
Description: Binary data