[GitHub] [arrow] jonkeane commented on a change in pull request #11783: ARROW-14894: [R] Integer overflow in write_parquet chunk size calculation

GitBox Fri, 03 Dec 2021 13:10:14 -0800


jonkeane commented on a change in pull request #11783:
URL: https://github.com/apache/arrow/pull/11783#discussion_r762070759




##########
File path: r/tests/testthat/test-parquet.R
##########
@@ -311,3 +311,7 @@ test_that("ParquetFileWrite chunk_size defaults", {
       expect_error(reader$ReadRowGroup(2), "Some index in row_group_indices")
     })
 })
+
+test_that("ParquetFileWrite chunk_size calculation doesn't have integer 
overflow issues (ARROW-14894)", {
+  expect_equal(calculate_chunk_size(31869547, 108, 2.5e8, 200), 2451504)

Review comment:
       ```suggestion
     expect_equal(calculate_chunk_size(31869547, 108, 2.5e8, 200), 2451504)
     
     # we can set the target cells per group, and it rounds appropriately
     expect_equal(calculate_chunk_size(100, 1, 25), 25)
     expect_equal(calculate_chunk_size(101, 1, 25), 26)
     
     # but our max_chunks is respected
     expect_equal(calculate_chunk_size(101, 1, 25, 2), 51)
   ```
   
   These are kinda-sorta the same tests as above, so we don't _need_ to include 
them but I find these a bit easier to reason about (and certainly easier to 
test than constructing a whole dataset!)

##########
File path: r/R/parquet.R
##########
@@ -608,3 +593,28 @@ ParquetArrowReaderProperties <- 
R6Class("ParquetArrowReaderProperties",
 ParquetArrowReaderProperties$create <- function(use_threads = 
option_use_threads()) {
   parquet___arrow___ArrowReaderProperties__Make(isTRUE(use_threads))
 }
+
+calculate_chunk_size <- function(rows,  columns,
+  target_cells_per_group = getOption("arrow.parquet_cells_per_group", 2.5e8),
+  max_chunks = getOption("arrow.parquet_max_chunks", 200)
+  ){

Review comment:
       ```suggestion
     ) {
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jonkeane commented on a change in pull request #11783: ARROW-14894: [R] Integer overflow in write_parquet chunk size calculation

Reply via email to