andishgar commented on issue #48123:
URL: https://github.com/apache/arrow/issues/48123#issuecomment-3561813079
@pitrou I conducted further research, and here are the results:
1-The implementation has a bug around numbers that are powers of 2. Here is
the code that proves it.
```c++
TEST(ULP, AroundPower2) {
double a_raw = 4;
double a = std::nextafter(a_raw, -1 *
std::numeric_limits<double>::infinity());
double b = std::nextafter(a_raw, std::numeric_limits<double>::infinity());
double c = std::nextafter(a, std::numeric_limits<double>::infinity());
c = std::nextafter(c, std::numeric_limits<double>::infinity());
ASSERT_EQ(b,c);
ASSERT_FALSE(WithinUlp(a, b, 2));
}
```
2- I created the benchmark to mimic the behavior of float comparisons on the
array, and here are the results.
```
My CPU is
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Vendor ID: GenuineIntel
Model name: 11th Gen Intel(R) Core(TM) i5-11400H @ 2.70GHz
CPU family: 6
Model: 141
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
Stepping: 1
CPU(s) scaling MHz: 45%
CPU max MHz: 4500.0000
CPU min MHz: 800.0000
```
the result in clang 20.1.2 is
```
2025-11-21T09:56:15+03:30
Running ../my_version/Benchmark_ULP
Run on (12 X 2315.73 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 1280 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 0.88, 0.98, 0.98
------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
UserCounters...
------------------------------------------------------------------------------------
my_float_version/1000 1996 ns 1996 ns 350625
bytes_per_second=3.7319Gi/s items_per_second=500.887M/s
my_float_version/10000 24942 ns 24941 ns 27731
bytes_per_second=2.98727Gi/s items_per_second=400.944M/s
my_float_version/100000 411094 ns 411095 ns 1719
bytes_per_second=1.81237Gi/s items_per_second=243.253M/s
my_float_version/1000000 4284970 ns 4284597 ns 163
bytes_per_second=1.73892Gi/s items_per_second=233.394M/s
my_float_version/10000000 44405282 ns 44384203 ns 16
bytes_per_second=1.67866Gi/s items_per_second=225.305M/s
my_double_verion/1000 2039 ns 2039 ns 342934
bytes_per_second=7.30777Gi/s items_per_second=490.416M/s
my_double_verion/10000 26622 ns 26621 ns 25899
bytes_per_second=5.59754Gi/s items_per_second=375.644M/s
my_double_verion/100000 444275 ns 444225 ns 1542
bytes_per_second=3.35442Gi/s items_per_second=225.111M/s
my_double_verion/1000000 4607614 ns 4606397 ns 150
bytes_per_second=3.23488Gi/s items_per_second=217.089M/s
my_double_verion/10000000 47061217 ns 47061698 ns 15
bytes_per_second=3.1663Gi/s items_per_second=212.487M/s
RUNNING: ../my_version/Benchmark_ULP --benchmark_filter=arrow.*
--benchmark_out=/tmp/tmp0jihl778
2025-11-21T09:56:28+03:30
Running ../my_version/Benchmark_ULP
Run on (12 X 3903.04 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 1280 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 0.98, 1.00, 0.99
----------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
UserCounters...
----------------------------------------------------------------------------------------
arrow_float_version/1000 8195 ns 8194 ns 85350
bytes_per_second=931.117Mi/s items_per_second=122.043M/s
arrow_float_version/10000 127369 ns 127346 ns 5439
bytes_per_second=599.106Mi/s items_per_second=78.526M/s
arrow_float_version/100000 1349496 ns 1349150 ns 517
bytes_per_second=565.496Mi/s items_per_second=74.1207M/s
arrow_float_version/1000000 13905260 ns 13901427 ns 51
bytes_per_second=548.821Mi/s items_per_second=71.9351M/s
arrow_float_version/10000000 137240593 ns 137240576 ns 5
bytes_per_second=555.914Mi/s items_per_second=72.8647M/s
arrow_double_version/1000 8865 ns 8863 ns 80202
bytes_per_second=1.68125Gi/s items_per_second=112.826M/s
arrow_double_version/10000 131861 ns 131836 ns 5296
bytes_per_second=1.13028Gi/s items_per_second=75.8519M/s
arrow_double_version/100000 1429648 ns 1429675 ns 483
bytes_per_second=1.04228Gi/s items_per_second=69.946M/s
arrow_double_version/1000000 15052944 ns 15047883 ns 47
bytes_per_second=1014.02Mi/s items_per_second=66.4545M/s
arrow_double_version/10000000 150475448 ns 150468935 ns 4
bytes_per_second=1014.08Mi/s items_per_second=66.4589M/s
Comparing my.* to arrow.* (from ../my_version/Benchmark_ULP)
Benchmark Time CPU Time Old
Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------
[my.* vs. arrow.*] +3.1050 +3.1042 1996
8195 1996 8194
[my.* vs. arrow.*] +4.1066 +4.1059 24942
127369 24941 127346
[my.* vs. arrow.*] +2.2827 +2.2818 411094
1349496 411095 1349150
[my.* vs. arrow.*] +2.2451 +2.2445 4284970
13905260 4284597 13901427
[my.* vs. arrow.*] +2.0906 +2.0921 44405282
137240593 44384203 137240576
[my.* vs. arrow.*] +3.3472 +3.3466 2039
8865 2039 8863
[my.* vs. arrow.*] +3.9530 +3.9523 26622
131861 26621 131836
[my.* vs. arrow.*] +2.2179 +2.2184 444275
1429648 444225 1429675
[my.* vs. arrow.*] +2.2670 +2.2667 4607614
15052944 4606397 15047883
[my.* vs. arrow.*] +2.1974 +2.1973 47061217
150475448 47061698 150468935
[my.* vs. arrow.*]_pvalue 0.4727 0.4727 U Test,
Repetitions: 10 vs 10
OVERALL_GEOMEAN +2.7141 +2.7139 0
0 0 0
```
the result in gcc 13.3 is
```
RUNNING: ../my_version/Benchmark_ULP --benchmark_filter=my.*
--benchmark_out=/tmp/tmpripsdal3
2025-11-21T10:42:42+03:30
Running ../my_version/Benchmark_ULP
Run on (12 X 2026.34 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 1280 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 0.82, 0.81, 1.02
------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
UserCounters...
------------------------------------------------------------------------------------
my_float_version/1000 1874 ns 1874 ns 361661
bytes_per_second=3.97566Gi/s items_per_second=533.605M/s
my_float_version/10000 67258 ns 67253 ns 10104
bytes_per_second=1.10784Gi/s items_per_second=148.692M/s
my_float_version/100000 832701 ns 832614 ns 790
bytes_per_second=916.319Mi/s items_per_second=120.104M/s
my_float_version/1000000 8801636 ns 8799939 ns 82
bytes_per_second=866.983Mi/s items_per_second=113.637M/s
my_float_version/10000000 87679986 ns 87640110 ns 8
bytes_per_second=870.537Mi/s items_per_second=114.103M/s
my_double_verion/1000 2061 ns 2060 ns 337844
bytes_per_second=7.23212Gi/s items_per_second=485.339M/s
my_double_verion/10000 57709 ns 57672 ns 11730
bytes_per_second=2.58376Gi/s items_per_second=173.393M/s
my_double_verion/100000 829816 ns 829136 ns 853
bytes_per_second=1.79719Gi/s items_per_second=120.607M/s
my_double_verion/1000000 8788251 ns 8783822 ns 80
bytes_per_second=1.69643Gi/s items_per_second=113.846M/s
my_double_verion/10000000 87873772 ns 87843363 ns 8
bytes_per_second=1.69633Gi/s items_per_second=113.839M/s
RUNNING: ../my_version/Benchmark_ULP --benchmark_filter=arrow.*
--benchmark_out=/tmp/tmpb6xqe856
2025-11-21T10:42:52+03:30
Running ../my_version/Benchmark_ULP
Run on (12 X 2198.99 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 1280 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 0.93, 0.83, 1.03
----------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
UserCounters...
----------------------------------------------------------------------------------------
arrow_float_version/1000 8104 ns 8104 ns 83413
bytes_per_second=941.477Mi/s items_per_second=123.401M/s
arrow_float_version/10000 163596 ns 163530 ns 4382
bytes_per_second=466.543Mi/s items_per_second=61.1507M/s
arrow_float_version/100000 1694072 ns 1693670 ns 407
bytes_per_second=450.465Mi/s items_per_second=59.0434M/s
arrow_float_version/1000000 17004234 ns 17003336 ns 41
bytes_per_second=448.7Mi/s items_per_second=58.812M/s
arrow_float_version/10000000 169239383 ns 169224726 ns 4
bytes_per_second=450.844Mi/s items_per_second=59.093M/s
arrow_double_version/1000 9198 ns 9195 ns 75753
bytes_per_second=1.62055Gi/s items_per_second=108.753M/s
arrow_double_version/10000 156480 ns 156431 ns 4554
bytes_per_second=975.434Mi/s items_per_second=63.926M/s
arrow_double_version/100000 1686394 ns 1685713 ns 419
bytes_per_second=905.183Mi/s items_per_second=59.3221M/s
arrow_double_version/1000000 17405253 ns 17403154 ns 40
bytes_per_second=876.783Mi/s items_per_second=57.4608M/s
arrow_double_version/10000000 175229657 ns 175218073 ns 4
bytes_per_second=870.846Mi/s items_per_second=57.0717M/s
Comparing my.* to arrow.* (from ../my_version/Benchmark_ULP)
Benchmark Time CPU Time Old
Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------
[my.* vs. arrow.*] +3.3242 +3.3241 1874
8104 1874 8104
[my.* vs. arrow.*] +1.4324 +1.4316 67258
163596 67253 163530
[my.* vs. arrow.*] +1.0344 +1.0342 832701
1694072 832614 1693670
[my.* vs. arrow.*] +0.9319 +0.9322 8801636
17004234 8799939 17003336
[my.* vs. arrow.*] +0.9302 +0.9309 87679986
169239383 87640110 169224726
[my.* vs. arrow.*] +3.4620 +3.4627 2061
9198 2060 9195
[my.* vs. arrow.*] +1.7115 +1.7124 57709
156480 57672 156431
[my.* vs. arrow.*] +1.0323 +1.0331 829816
1686394 829136 1685713
[my.* vs. arrow.*] +0.9805 +0.9813 8788251
17405253 8783822 17403154
[my.* vs. arrow.*] +0.9941 +0.9947 87873772
175229657 87843363 175218073
[my.* vs. arrow.*]_pvalue 0.4727 0.4727 U Test,
Repetitions: 10 vs 10
OVERALL_GEOMEAN +1.4486 +1.4490 0
0 0 0
```
3-Considering these factors—the existing implementation contains a bug and
is more than twice as slow under Clang, although the slowdown may not be
significant—should we retain the current implementation or adopt the new one
for float16 and enable it in Arrow’s equality method?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]