nevi-me opened a new pull request #512:
URL: https://github.com/apache/arrow-rs/pull/512


   # Which issue does this PR close?
   
   None, I'm opening this to bank some work that I did while investigating #385 
   
   # Rationale for this change
    
   The parquet writer computes row group stats record-by-record when writing. 
There's an alternative of providing computed stats to avoid this process.
   
   This would allow us to also pass in the distinct count of records, as that 
seems to be desirable for IOx.
   
   # What changes are included in this PR?
   
   Computes the stats using `arrow::compute` for some column types.
   The PR is incomplete, as I want to solicit feedback first.
   
   This is on top of #511, so should be reviewed after it.
   
   # Are there any user-facing changes?
   
   No
   
   ___
   
   There are no noticeable performance changes, per:
   
   ```bash
   cargo bench -p parquet --bench arrow_writer
   ```
   
   ```
   write_batch primitive/1024 values
                           time:   [1.5005 ms 1.5055 ms 1.5112 ms]
                           thrpt:  [66.781 MiB/s 67.035 MiB/s 67.261 MiB/s]
                    change:
                           time:   [-0.0027% +0.7862% +1.5464%] (p = 0.05 < 
0.05)
                           thrpt:  [-1.5229% -0.7801% +0.0027%]
                           Change within noise threshold.
   Found 4 outliers among 100 measurements (4.00%)
     4 (4.00%) high mild
   write_batch primitive/4096 values
                           time:   [5.2132 ms 5.2259 ms 5.2392 ms]
                           thrpt:  [75.460 MiB/s 75.653 MiB/s 75.838 MiB/s]
                    change:
                           time:   [-1.2864% -0.8463% -0.4253%] (p = 0.00 < 
0.05)
                           thrpt:  [+0.4272% +0.8535% +1.3031%]
                           Change within noise threshold.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to