llama90 commented on issue #38569:
URL: https://github.com/apache/arrow/issues/38569#issuecomment-1801998369
Thank you for your response. cc
#### For string
> BTW, why did you compare AsciiLower/AsciiUpper performance with multiple
inputs? They are Arrow's compute kernels not Gandiva's functions. We are
working on Gandiva's benchmark not Arrow's compute kernels, right?
You are right. We are working on Ganvida's benchmark.
In **summary**, it appears that:
* We should focus on `Allocs` rather than `upper`.
* The mix of uppercase and lowercase characters in string generation is
_**not**_ significant.
* Previously, strings of lowercase characters of varying lengths from 1 to
64 were generated, but for benchmark consistency, we now maintain a length of
64 when generating strings.
<details><summary>previous generated string data</summary>
```
[
"bcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrs",
"tuvwxyzabc",
"defghijklm",
"nopqrstuvwxyzabcdefghijklmnopqrstuvwxy",
"zabcdefghijklmnopqrstuvwx",
"yzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabc",
"defghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghij",
"klmnopqrstu",
"vwxyzabcdefghijklmnopqrstuvwxy",
"z",
...
"ijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrs",
"tuvwxyzabcdefgh",
"ijklmnopqrstuvwxyzabcdefghijklmnopqrs",
"tuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopq",
"rstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopq",
"rstuvwxyzabcdefghijklmn",
"opqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklm",
"nopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqr",
"stuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrst",
"uvwxyzabcdefghijklmnopqrstuvwxyzabc"
]
```
</details>
To briefly explain why we conducted the test, I've observed that the
previous benchmarks selected characters within the `a`-`z` range to generate
data. However, when using `random.h`, characters are selected from the `A` -
`z` range, which led to questions when I saw differences in benchmark times.
```diff
# previous
- TimedTestAllocs/min_time:1.000 140487 us 137767
us 10
- TimedTestOutputStringAllocs/min_time:1.000 228228 us 226211
us 6
# random.h
+ TimedTestAllocs/min_time:1.000 243130 us 242760
us 6
+ TimedTestOutputStringAllocs/min_time:1.000 332357 us 331799
us 4
```
Upon further inspection, particularly of the
`arrow-compute-scalar-string-benchmark`, I noticed that `AsciiLower` and
`AsciiUpper` show different performances depending on the range of characters
in the string, which is why I mentioned the `upper` aspect.
My understanding of Gandiva is limited, but I believe that mixing
`uppercase` and `lowercase` could have a similar impact.
#### For decimal
In this case, I think you should check the `precision` and `scale` of the
Decimal Type first. Was I searching for the wrong thing? I was understanding
scale and precision in reverse.
And for the existing decimal generation code, it was implemented like this
https://github.com/apache/arrow/blob/e62ec62e40b04b0bfce76d58369845d3aa96a419/cpp/src/gandiva/tests/generate_data.h#L82-L98
Anyway, I'm having a hard time figuring out how to fit this in because for
`random.h`, the `precision` and `scale` values are the same as before, but the
random values generated are different.
<details><summary>previous generated DecimalAdd2LeadingZeroes data</summary>
```
[
16351795034091788378275.759672,
27854373977842001526319.734684,
6061446293668872071695.356134,
31215470323596391714672.813452,
15533332405516061115471.066480,
15401288582294600038721.016490,
30274851179048759105303.625351,
6564021121342245976114.232879,
36171304665527520454124.155732,
3546578283628563437121.241384,
...
14860910085674466129132.927835,
6131533904566936343671.748478,
36528723632058120161272.517328,
3964543116664404486145.446787,
20606324915198138790682.388805,
20631004666520402209109.905293,
29878891319444493588791.347655,
13823654056734867583462.044256,
30786726283235031078880.608496,
620129083913073530975.895978
]
```
</details>
<details><summary>`random.h` generated DecimalAdd2LeadingZeroes
data</summary>
```
[
-10038291425340000002078631.818296,
20434255526017535581463554076021.370910,
45129537519062609815007557527197.169527,
-62154291670684288853156612735793.477627,
87103160282621257590199953623276.740180,
79969063163511904595223170720469.422729,
-86583636649126666385105002765466.142621,
92352037140996988957666339677514.598197,
25931328664987260071252581126967.305358,
91215577756571866841817287196691.182509,
...
52281930080182271842503590963377.447177,
4372942953952971927671269281255.305966,
96763102312104951696190306530776.653371,
-18467695819003138567360094105422.436093,
-9792209172207529275819289504154.021690,
61852766762170189161854285754108.616433,
25571897876364008995405425598827.463597,
64701056156004809966065180476482.250553,
-63257829389611554338056688929518.067776,
67455720423980831654534847333775.750429
]
```
</details>
<details><summary>previous generated DecimalAdd2LeadingZeroesWithDiv
data</summary>
```
[
7587380203.400195060290629070,
26540763051.629767310192345708,
7223529576.634111708006082596,
6365242807.233231030439853103,
30019804134.302409408324050740,
25340645699.884287035225737289,
10696019445.722928448197246826,
949466737.960961850748143665,
10100302893.597466582610415885,
3997196768.260434038917129614,
...
32227537378.843572088950643055,
8537777136.877817787746240968,
30551609561.954519025705399073,
26680430508.348522923995981443,
33569096800.981186984459717330,
3351111436.781940463508363016,
24558127179.325481340697631536,
22734710467.783675740496138638,
20732616951.383696862539113282,
23868794600.064308290332457814
]
```
</details>
<details><summary>`random.h` generated DecimalAdd2LeadingZeroes
data</summary>
```
[
14208508617760.000001420629929750,
34494975892204083434.054431421823503934,
-59808716875461356024.747039105565396418,
-39719453848714764419.682550443910720831,
-87317155083825605821.703637655099976619,
-65044845346732651787.465211119366632301,
-88009445715660277207.144934805781121915,
95937137753543307790.043195390657972343,
20590051896038797936.563280809743603093,
49223527283939584300.218543938584200541,
...
19342162338087482443.922956656008521477,
-48849509529657253769.022422810864683008,
-22694641636271599896.031597585758119284,
-40556169347524762115.197985730587632297,
12188092371682991218.322898910856250055,
85478353248271990406.317595833539039857,
75882328331451585423.885909771843576864,
53687092121591423454.148705140146222481,
-44309181005291397260.084537547533448259,
-95059575541611513750.967476857639700305
]
```
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]