[GitHub] [spark] LuciferYang opened a new pull request, #36571: [SPARK-39202][SQL] Introduce a `putByteArrays` method to `WritableColumnVector` to support setting multiple duplicate `byte[]`

GitBox Mon, 16 May 2022 20:15:42 -0700


LuciferYang opened a new pull request, #36571:
URL: https://github.com/apache/spark/pull/36571


   ### What changes were proposed in this pull request?
   This pr add a `putByteArrays` method to `WritableColumnVector` as follows:
   
   ```java
   public int putByteArrays(int rowId, int total, byte[] value) 
   ```
   
   This method used to support setting multiple duplicate `byte[]` to 
`WritableColumnVector`. Since `byte[] value` is fixed length, memory can 
allocated at one time without calling `reserve(int requiredCapacity)` method 
many times.
   
   The new method is applicable to `ColumnVectorUtils.populate` method with 
`StringType` and partial `DecimalType` scenario, this corresponds to the 
Vectorized Partition Column filling of Parquet and Orc.
   
   
   ### Why are the changes needed?
   Reduce `reserve(int requiredCapacity)` call times to avoid memory allocation 
times in setting multiple duplicate fixed length `byte[]` to  
`WritableColumnVector` scene.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   - Pass GA
   - Add a `StringType` partition column test scenario in 
`DataSourceReadBenchmark`.
   
   **Before**
   
   ```
   OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1022-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Partitioned Table(Partition Column Type = StringType):  Best Time(ms)   Avg 
Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
-------------------------------------------------------------------------------------------------------------------------------------
   Data column - CSV                                              19487         
 19517          42          0.8        1239.0       1.0X
   Data column - Json                                             12943         
 12948           7          1.2         822.9       1.5X
   Data column - Parquet Vectorized: DataPageV1                     219         
   224           6         72.0          13.9      89.2X
   Data column - Parquet Vectorized: DataPageV2                     494         
   501           7         31.9          31.4      39.5X
   Data column - Parquet MR: DataPageV1                            2515         
  2521           9          6.3         159.9       7.7X
   Data column - Parquet MR: DataPageV2                            2327         
  2337          14          6.8         148.0       8.4X
   Data column - ORC Vectorized                                     303         
   306           2         51.9          19.3      64.3X
   Data column - ORC MR                                            2126         
  2130           6          7.4         135.2       9.2X
   Partition column - CSV                                          6480         
  6482           2          2.4         412.0       3.0X
   Partition column - Json                                        10564         
 10572          11          1.5         671.6       1.8X
   Partition column - Parquet Vectorized: DataPageV1                 53         
    58          16        296.6           3.4     367.5X
   Partition column - Parquet Vectorized: DataPageV2                 52         
    57          10        303.6           3.3     376.1X
   Partition column - Parquet MR: DataPageV1                       1231         
  1232           2         12.8          78.3      15.8X
   Partition column - Parquet MR: DataPageV2                       1227         
  1229           3         12.8          78.0      15.9X
   Partition column - ORC Vectorized                                 52         
    57           8        300.3           3.3     372.0X
   Partition column - ORC MR                                       1334         
  1343          11         11.8          84.8      14.6X
   Both columns - CSV                                             19608         
 19626          25          0.8        1246.6       1.0X
   Both columns - Json                                            13003         
 13018          22          1.2         826.7       1.5X
   Both columns - Parquet Vectorized: DataPageV1                    262         
   269           7         60.1          16.6      74.4X
   Both columns - Parquet Vectorized: DataPageV2                    538         
   541           6         29.3          34.2      36.3X
   Both columns - Parquet MR: DataPageV1                           2569         
  2570           2          6.1         163.3       7.6X
   Both columns - Parquet MR: DataPageV2                           2343         
  2361          26          6.7         148.9       8.3X
   Both columns - ORC Vectorized                                    344         
   345           1         45.7          21.9      56.6X
   Both columns - ORC MR                                           2173         
  2178           8          7.2         138.1       9.0X
   ```
   
   **After**
   
   ```
   OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1022-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Partitioned Table(Partition Column Type = StringType):  Best Time(ms)   Avg 
Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
-------------------------------------------------------------------------------------------------------------------------------------
   Data column - CSV                                              22546         
 22554          11          0.7        1433.4       1.0X
   Data column - Json                                             13638         
 13638           0          1.2         867.1       1.7X
   Data column - Parquet Vectorized: DataPageV1                     208         
   214           9         75.7          13.2     108.5X
   Data column - Parquet Vectorized: DataPageV2                     488         
   492           5         32.2          31.0      46.2X
   Data column - Parquet MR: DataPageV1                            2625         
  2631           9          6.0         166.9       8.6X
   Data column - Parquet MR: DataPageV2                            2323         
  2328           8          6.8         147.7       9.7X
   Data column - ORC Vectorized                                     296         
   300           6         53.1          18.8      76.1X
   Data column - ORC MR                                            2154         
  2156           2          7.3         136.9      10.5X
   Partition column - CSV                                          6410         
  6434          34          2.5         407.6       3.5X
   Partition column - Json                                        10021         
 10028          10          1.6         637.1       2.2X
   Partition column - Parquet Vectorized: DataPageV1                 51         
    55          10        306.9           3.3     439.9X
   Partition column - Parquet Vectorized: DataPageV2                 51         
    55           9        308.1           3.2     441.6X
   Partition column - Parquet MR: DataPageV1                       1207         
  1209           2         13.0          76.7      18.7X
   Partition column - Parquet MR: DataPageV2                       1222         
  1237          22         12.9          77.7      18.5X
   Partition column - ORC Vectorized                                 52         
    55           8        304.2           3.3     436.1X
   Partition column - ORC MR                                       1310         
  1310           0         12.0          83.3      17.2X
   Both columns - CSV                                             22310         
 22318          11          0.7        1418.4       1.0X
   Both columns - Json                                            13625         
 13629           5          1.2         866.3       1.7X
   Both columns - Parquet Vectorized: DataPageV1                    248         
   256          13         63.4          15.8      90.9X
   Both columns - Parquet Vectorized: DataPageV2                    529         
   555          50         29.7          33.7      42.6X
   Both columns - Parquet MR: DataPageV1                           2634         
  2641          10          6.0         167.5       8.6X
   Both columns - Parquet MR: DataPageV2                           2375         
  2377           3          6.6         151.0       9.5X
   Both columns - ORC Vectorized                                    338         
   339           1         46.5          21.5      66.6X
   Both columns - ORC MR                                           2189         
  2193           5          7.2         139.2      10.3X
   
   ```
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] LuciferYang opened a new pull request, #36571: [SPARK-39202][SQL] Introduce a `putByteArrays` method to `WritableColumnVector` to support setting multiple duplicate `byte[]`

Reply via email to