paleolimbot commented on code in PR #36992:
URL: https://github.com/apache/arrow/pull/36992#discussion_r1290198337
##########
r/tests/testthat/test-parquet.R:
##########
@@ -484,3 +484,43 @@ test_that("Can read Parquet files from a URL", {
expect_true(tibble::is_tibble(pu))
expect_identical(dim(pu), c(10L, 11L))
})
+
+test_that("thrift string and container size can be specified when reading
Parquet files", {
+
+ tf <- tempfile()
+ on.exit(unlink(tf))
+ table <- arrow_table(example_data)
+ write_parquet(table, tf)
+ file <- make_readable_file(tf)
+ on.exit(file$close())
+
+ # thrift string size
+ reader_props <- ParquetReaderProperties$create()
+ reader_props$set_thrift_string_size_limit(1)
+ expect_identical(reader_props$thrift_string_size_limit(), 1L)
+
+ # We get an error if we set the Thrift string size limit too small
+ expect_error(ParquetFileReader$create(file, reader_props = reader_props),
"TProtocolException: Exceeded size limit")
+
+ # Increase the size and we can read successfully
+ reader_props$set_thrift_string_size_limit(10000)
+ reader <- ParquetFileReader$create(file, reader_props = reader_props)
+ data <- reader$ReadTable()
+ expect_identical(collect.ArrowTabular(data), example_data)
Review Comment:
With respect to C++ messages, there is an issue that you or I could tackle
to make that slightly better ( https://github.com/apache/arrow/issues/34826 );
however, we test parts of error messages in dozens of places and it doesn't
cause any maintenance effort that I am aware of. Testing the C++ message is
less about testing C++ behaviour and 100% about making sure we are testing the
code path that we think we are testing. `expect_error(with_a_tpyo())` still
passes, as would a change in the underlying code that produces a completely
unrelated error (e.g., a typo in the thrift setting/getting that results in
"object not found" of "closure not subsettable"). It is very important that we
do not swallow those errors and the C++ error messages almost never change
(when they do, C++ developers have historically been very good at updating our
tests to fix the failing CI).
> I started thinking we should only test that we can get/set these values,
but not test the outcomes of running the code.
I think that is the 100% correct way to go about it! The connection between
pyarrow and Arrow C++ is closer and it is sometimes easier to write the tests
in Python...I would not be surprised if many pieces of Arrow C++ relied on test
coverage from Python. I don't think duplicating those tests in this PR is
needed (you could open an issue for the tests to be added to Arrow C++...in the
discussion for that ticket, it may come about that R is a good choice for those
tests or it may not, or we might not have bandwidth to get there).
> I did have thoughts about splitting out tests into separate files for
integration tests versus unit tests, but I think this would be more of an
exercise in philosophical pedantry than adding actual value or making our lives
easier.
I think this does have value and would make our lives easier; however, I
don't think either of us have time to do it right now. To a certain extent we
already do this via benchmarks and the extra-tests folder, which don't run on
CRAN but do run on CI (I think).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]