westonpace commented on code in PR #15179:
URL: https://github.com/apache/arrow/pull/15179#discussion_r1061928948
##########
cpp/src/parquet/arrow/arrow_reader_writer_test.cc:
##########
@@ -4138,6 +4139,74 @@ TEST_P(TestArrowWriteDictionary, Statistics) {
INSTANTIATE_TEST_SUITE_P(WriteDictionary, TestArrowWriteDictionary,
::testing::Values(ParquetDataPageVersion::V1,
ParquetDataPageVersion::V2));
+
+TEST_P(TestArrowWriteDictionary, StatisticsUnifiedDictionary) {
+ // Two chunks, with a shared dictionary
+ std::shared_ptr<::arrow::Table> table;
+ std::shared_ptr<::arrow::DataType> dict_type =
+ ::arrow::dictionary(::arrow::int32(), ::arrow::utf8());
+ std::shared_ptr<::arrow::Schema> schema =
+ ::arrow::schema({::arrow::field("values", dict_type)});
+ {
+ // It's important there are no duplicate values in the dictionary,
otherwise
+ // we trigger the WriteDense() code path which side-steps dictionary
encoding.
+ std::shared_ptr<::arrow::Array> test_dictionary =
+ ArrayFromJSON(::arrow::utf8(), R"(["b", "c", "d", "a"])");
+ std::vector<std::shared_ptr<::arrow::Array>> test_indices = {
+ ArrayFromJSON(::arrow::int32(),
+ R"([0, null, 3, 0, null, 3])"), // ["b", null "a", "b",
null, "a"]
+ ArrayFromJSON(
+ ::arrow::int32(),
+ R"([0, 3, null, 0, null, 1])")}; // ["b", "c", null, "b", "c",
null]
Review Comment:
Write...so I think (but could be wrong) this would lead to three calls to
`WriteArrowDictionary`:
Call #1: (no previous dictionary) min=a, max=b, nulls=2
Call #2: (previous dictionary is equal) min=a, max=b, nulls=1
Call #3: (no previous dictionary) min=b, max=c, nulls=1
So if the bug was still in place, and it was using the first chunk to
determine row-group statistics, it would still get the correct answer in this
case.
Admittedly, the null count would still be wrong (it would report 2 nulls for
stat0), so the test case itself wouldn't pass with the old code. But I think
it would get further than it should.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]