raulcd commented on PR #88:
URL: https://github.com/apache/parquet-testing/pull/88#issuecomment-3048249553
Thanks @wgtmac
I've added two more columns to have the `is_{min/max}_value_exact=true` case
and I've added more info to the README because I thought the file was growing
in complexity so it could be worth adding it. Let me know if you find it too
verbose, I can remove the outputs of paquet-cli. The summary is:
|column_name |min |is_min_value_exact|max
|is_max_value_exact|
| ------------------------- | ------- | ---------------- | ------------- |
---------------- |
|utf8_full_truncation |"Al" |false |"Kf"
|false |
|binary_full_truncation |"0x416C" |false |"0x4B66"
|false |
|utf8_partial_truncation |"Al" |false |"🚀Kevin
Bacon"|true |
|binary_partial_truncation |"0x416C" |false |"0xFFFF0102"
|true |
|utf8_no_truncation |"Al" |true |"Ke"
|true |
|binary_no_truncation |"0x416C" |true |"0x4B65"
|true |
As per how to test it because parquet-cli doesn't have that yet, I agree we
should add it to `ParquetFilePrinter`, I can add it on my PR but on that PR I
am using this file and validating those fields and flags, see:
- https://github.com/apache/arrow/pull/46992
The main gist of the test validates that the file is correct and
`is_{min/max}_value_exact` are true/false when they should:
```
ASSERT_EQ(6, metadata->num_columns());
for (int num_column = 0; num_column < metadata->num_columns();
++num_column) {
auto column_chunk = metadata->ColumnChunk(num_column);
ASSERT_TRUE(column_chunk->is_stats_set());
std::shared_ptr<EncodedStatistics> encoded_statistics =
column_chunk->encoded_statistics();
ASSERT_TRUE(encoded_statistics != NULL);
ASSERT_EQ(0, encoded_statistics->null_count);
EXPECT_EQ("Al", encoded_statistics->min());
ASSERT_EQ(true, encoded_statistics->has_is_max_value_exact);
ASSERT_EQ(true, encoded_statistics->has_is_min_value_exact);
switch (num_column) {
case 2:
// Max couldn't truncate the utf-8 string longer than 2 bytes
EXPECT_EQ("🚀Kevin Bacon", encoded_statistics->max());
ASSERT_EQ(true, encoded_statistics->is_max_value_exact);
ASSERT_EQ(false, encoded_statistics->is_min_value_exact);
break;
case 3:
// Max couldn't truncate 0xFFFF binary string
EXPECT_EQ("\xFF\xFF\x1\x2", encoded_statistics->max());
ASSERT_EQ(true, encoded_statistics->is_max_value_exact);
ASSERT_EQ(false, encoded_statistics->is_min_value_exact);
break;
case 4:
case 5:
// Min and Max are not truncated, fit on 2 bytes
EXPECT_EQ("Kf", encoded_statistics->max());
ASSERT_EQ(true, encoded_statistics->is_max_value_exact);
ASSERT_EQ(true, encoded_statistics->is_min_value_exact);
break;
default:
// Max truncated to 2 bytes on columns 0 and 1
EXPECT_EQ("Kf", encoded_statistics->max());
ASSERT_EQ(false, encoded_statistics->is_max_value_exact);
ASSERT_EQ(false, encoded_statistics->is_min_value_exact);
}
```
So it's a chicken an egg problem :) . I would merge this (once we are happy
with the file) and I can work on the C++ PR to add the flags to
`EncodedStatistics` and `ParquetFilePrinter`.
Let me know if you want me to follow a different approach.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]