[GitHub] [arrow] paleolimbot commented on a diff in pull request #36992: GH-36990: [R] Expose Parquet ReaderProperties

via GitHub Thu, 10 Aug 2023 07:12:25 -0700


paleolimbot commented on code in PR #36992:
URL: https://github.com/apache/arrow/pull/36992#discussion_r1290198337



##########
r/tests/testthat/test-parquet.R:
##########
@@ -484,3 +484,43 @@ test_that("Can read Parquet files from a URL", {
   expect_true(tibble::is_tibble(pu))
   expect_identical(dim(pu), c(10L, 11L))
 })
+
+test_that("thrift string and container size can be specified when reading 
Parquet files", {
+
+  tf <- tempfile()
+  on.exit(unlink(tf))
+  table <- arrow_table(example_data)
+  write_parquet(table, tf)
+  file <- make_readable_file(tf)
+  on.exit(file$close())
+
+  # thrift string size
+  reader_props <- ParquetReaderProperties$create()
+  reader_props$set_thrift_string_size_limit(1)
+  expect_identical(reader_props$thrift_string_size_limit(), 1L)
+
+  # We get an error if we set the Thrift string size limit too small
+  expect_error(ParquetFileReader$create(file, reader_props = reader_props), 
"TProtocolException: Exceeded size limit")
+
+  # Increase the size and we can read successfully
+  reader_props$set_thrift_string_size_limit(10000)
+  reader <- ParquetFileReader$create(file, reader_props = reader_props)
+  data <- reader$ReadTable()
+  expect_identical(collect.ArrowTabular(data), example_data)

Review Comment:
   With respect to C++ messages, there is an issue that you or I could tackle 
to make that slightly better ( https://github.com/apache/arrow/issues/34826 ); 
however, we test parts of error messages in dozens of places and it doesn't 
cause any maintenance effort that I am aware of. Testing the C++ message is 
less about testing C++ behaviour and 100% about making sure we are testing the 
code path that we think we are testing. `expect_error(with_a_tpyo())` still 
passes, as would a change in the underlying code that produces a completely 
unrelated error (e.g., a typo in the thrift setting/getting that results in 
"object not found" of "closure not subsettable"). It is very important that we 
do not swallow those errors and the C++ error messages almost never change 
(when they do, C++ developers have historically been very good at updating our 
tests to fix the failing CI).
   
   > I started thinking we should only test that we can get/set these values, 
but not test the outcomes of running the code. 
   
   I think that is the 100% correct way to go about it! The connection between 
pyarrow and Arrow C++ is closer and it is sometimes easier to write the tests 
in Python...I would not be surprised if many pieces of Arrow C++ relied on test 
coverage from Python. I don't think duplicating those tests in this PR is 
needed (you could open an issue for the tests to be added to Arrow C++...in the 
discussion for that ticket, it may come about that R is a good choice for those 
tests or it may not, or we might not have bandwidth to get there).
   
   > I did have thoughts about splitting out tests into separate files for 
integration tests versus unit tests, but I think this would be more of an 
exercise in philosophical pedantry than adding actual value or making our lives 
easier.
   
   I think this does have value and would make our lives easier; however, I 
don't think either of us have time to do it right now. To a certain extent we 
already do this via benchmarks and the extra-tests folder, which don't run on 
CRAN but do run on CI (I think).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] paleolimbot commented on a diff in pull request #36992: GH-36990: [R] Expose Parquet ReaderProperties

Reply via email to