westonpace commented on code in PR #15179:
URL: https://github.com/apache/arrow/pull/15179#discussion_r1061885341
##########
cpp/src/parquet/arrow/arrow_reader_writer_test.cc:
##########
@@ -4138,6 +4139,74 @@ TEST_P(TestArrowWriteDictionary, Statistics) {
INSTANTIATE_TEST_SUITE_P(WriteDictionary, TestArrowWriteDictionary,
::testing::Values(ParquetDataPageVersion::V1,
ParquetDataPageVersion::V2));
+
+TEST_P(TestArrowWriteDictionary, StatisticsUnifiedDictionary) {
+ // Two chunks, with a shared dictionary
+ std::shared_ptr<::arrow::Table> table;
+ std::shared_ptr<::arrow::DataType> dict_type =
+ ::arrow::dictionary(::arrow::int32(), ::arrow::utf8());
+ std::shared_ptr<::arrow::Schema> schema =
+ ::arrow::schema({::arrow::field("values", dict_type)});
+ {
+ // It's important there are no duplicate values in the dictionary,
otherwise
+ // we trigger the WriteDense() code path which side-steps dictionary
encoding.
+ std::shared_ptr<::arrow::Array> test_dictionary =
+ ArrayFromJSON(::arrow::utf8(), R"(["b", "c", "d", "a"])");
+ std::vector<std::shared_ptr<::arrow::Array>> test_indices = {
+ ArrayFromJSON(::arrow::int32(),
+ R"([0, null, 3, 0, null, 3])"), // ["b", null "a", "b",
null, "a"]
+ ArrayFromJSON(
+ ::arrow::int32(),
+ R"([0, 3, null, 0, null, 1])")}; // ["b", "c", null, "b", "c",
null]
Review Comment:
```suggestion
R"([0, 1, null, 0, 1, null])")}; // ["b", "c", null, "b", "c",
null]
```
I like what you have in the comment because then the min/max of row group 0
/ chunk 0 is different from row group 0 / chunk 1. Right now it looks like
your indices don't match your comment and we have:
// ["b", null, "a", "b", null, "c"]
This leads to a/b being the min/max in stats0 but a/b is the min/max in both
chunks of stats0. To reproduce I think we want what you have in the comment
which would mean chunk 0 is a/b and chunk 1 is b/c and so stats0 should be a/c.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]