LuciferYang opened a new pull request, #36571:
URL: https://github.com/apache/spark/pull/36571
### What changes were proposed in this pull request?
This pr add a `putByteArrays` method to `WritableColumnVector` as follows:
```java
public int putByteArrays(int rowId, int total, byte[] value)
```
This method used to support setting multiple duplicate `byte[]` to
`WritableColumnVector`. Since `byte[] value` is fixed length, memory can
allocated at one time without calling `reserve(int requiredCapacity)` method
many times.
The new method is applicable to `ColumnVectorUtils.populate` method with
`StringType` and partial `DecimalType` scenario, this corresponds to the
Vectorized Partition Column filling of Parquet and Orc.
### Why are the changes needed?
Reduce `reserve(int requiredCapacity)` call times to avoid memory allocation
times in setting multiple duplicate fixed length `byte[]` to
`WritableColumnVector` scene.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- Pass GA
- Add a `StringType` partition column test scenario in
`DataSourceReadBenchmark`.
**Before**
```
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1022-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Partitioned Table(Partition Column Type = StringType): Best Time(ms) Avg
Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------------------
Data column - CSV 19487
19517 42 0.8 1239.0 1.0X
Data column - Json 12943
12948 7 1.2 822.9 1.5X
Data column - Parquet Vectorized: DataPageV1 219
224 6 72.0 13.9 89.2X
Data column - Parquet Vectorized: DataPageV2 494
501 7 31.9 31.4 39.5X
Data column - Parquet MR: DataPageV1 2515
2521 9 6.3 159.9 7.7X
Data column - Parquet MR: DataPageV2 2327
2337 14 6.8 148.0 8.4X
Data column - ORC Vectorized 303
306 2 51.9 19.3 64.3X
Data column - ORC MR 2126
2130 6 7.4 135.2 9.2X
Partition column - CSV 6480
6482 2 2.4 412.0 3.0X
Partition column - Json 10564
10572 11 1.5 671.6 1.8X
Partition column - Parquet Vectorized: DataPageV1 53
58 16 296.6 3.4 367.5X
Partition column - Parquet Vectorized: DataPageV2 52
57 10 303.6 3.3 376.1X
Partition column - Parquet MR: DataPageV1 1231
1232 2 12.8 78.3 15.8X
Partition column - Parquet MR: DataPageV2 1227
1229 3 12.8 78.0 15.9X
Partition column - ORC Vectorized 52
57 8 300.3 3.3 372.0X
Partition column - ORC MR 1334
1343 11 11.8 84.8 14.6X
Both columns - CSV 19608
19626 25 0.8 1246.6 1.0X
Both columns - Json 13003
13018 22 1.2 826.7 1.5X
Both columns - Parquet Vectorized: DataPageV1 262
269 7 60.1 16.6 74.4X
Both columns - Parquet Vectorized: DataPageV2 538
541 6 29.3 34.2 36.3X
Both columns - Parquet MR: DataPageV1 2569
2570 2 6.1 163.3 7.6X
Both columns - Parquet MR: DataPageV2 2343
2361 26 6.7 148.9 8.3X
Both columns - ORC Vectorized 344
345 1 45.7 21.9 56.6X
Both columns - ORC MR 2173
2178 8 7.2 138.1 9.0X
```
**After**
```
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1022-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Partitioned Table(Partition Column Type = StringType): Best Time(ms) Avg
Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------------------
Data column - CSV 22546
22554 11 0.7 1433.4 1.0X
Data column - Json 13638
13638 0 1.2 867.1 1.7X
Data column - Parquet Vectorized: DataPageV1 208
214 9 75.7 13.2 108.5X
Data column - Parquet Vectorized: DataPageV2 488
492 5 32.2 31.0 46.2X
Data column - Parquet MR: DataPageV1 2625
2631 9 6.0 166.9 8.6X
Data column - Parquet MR: DataPageV2 2323
2328 8 6.8 147.7 9.7X
Data column - ORC Vectorized 296
300 6 53.1 18.8 76.1X
Data column - ORC MR 2154
2156 2 7.3 136.9 10.5X
Partition column - CSV 6410
6434 34 2.5 407.6 3.5X
Partition column - Json 10021
10028 10 1.6 637.1 2.2X
Partition column - Parquet Vectorized: DataPageV1 51
55 10 306.9 3.3 439.9X
Partition column - Parquet Vectorized: DataPageV2 51
55 9 308.1 3.2 441.6X
Partition column - Parquet MR: DataPageV1 1207
1209 2 13.0 76.7 18.7X
Partition column - Parquet MR: DataPageV2 1222
1237 22 12.9 77.7 18.5X
Partition column - ORC Vectorized 52
55 8 304.2 3.3 436.1X
Partition column - ORC MR 1310
1310 0 12.0 83.3 17.2X
Both columns - CSV 22310
22318 11 0.7 1418.4 1.0X
Both columns - Json 13625
13629 5 1.2 866.3 1.7X
Both columns - Parquet Vectorized: DataPageV1 248
256 13 63.4 15.8 90.9X
Both columns - Parquet Vectorized: DataPageV2 529
555 50 29.7 33.7 42.6X
Both columns - Parquet MR: DataPageV1 2634
2641 10 6.0 167.5 8.6X
Both columns - Parquet MR: DataPageV2 2375
2377 3 6.6 151.0 9.5X
Both columns - ORC Vectorized 338
339 1 46.5 21.5 66.6X
Both columns - ORC MR 2189
2193 5 7.2 139.2 10.3X
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]