wgtmac commented on code in PR #40957:
URL: https://github.com/apache/arrow/pull/40957#discussion_r1579012750
##########
cpp/src/parquet/encoding.cc:
##########
@@ -4055,4 +4055,57 @@ std::unique_ptr<Decoder> MakeDictDecoder(Type::type
type_num,
}
} // namespace detail
+
+bool IsParquetVersionAtLeast2_0(ParquetVersion::type parquet_version) {
Review Comment:
Should we refactor this to `bool
IsParquetVersionAtLeast(ParquetVersion::type parquet_version,
ParquetVersion::type target_version)`? BYTE_STREAM_SPLIT was introduced in
PARQUET_2_8 and will be extended in PARQUET_2_11. We need a way to check
minimum version of an encoding.
##########
cpp/src/parquet/encoding.cc:
##########
@@ -4055,4 +4055,57 @@ std::unique_ptr<Decoder> MakeDictDecoder(Type::type
type_num,
}
} // namespace detail
+
+bool IsParquetVersionAtLeast2_0(ParquetVersion::type parquet_version) {
+ switch (parquet_version) {
+ case ParquetVersion::PARQUET_1_0:
+ return false;
+ case ParquetVersion::PARQUET_2_4:
+ case ParquetVersion::PARQUET_2_6:
+ return true;
+ default:
+ return parquet_version > ParquetVersion::PARQUET_1_0;
+ }
+ return false;
+}
+
+// Informed heavily by
https://github.com/apache/parquet-format/blob/master/Encodings.md
+// and
+//
https://github.com/apache/arrow-rs/blob/a15adf66a8212fdb3a0222a32130ce70cfc28cf5/parquet/src/column/writer/mod.rs#L1183
+Encoding::type ChooseNonDictEncoding(Type::type data_type,
Review Comment:
We may also need input logical type here because FIXED_LEN_BYTE_ARRAY with
logical type FLOAT16|DECIMAL should support BYTE_STREAM_SPLIT in PARQUET_2_11.
cc @pitrou
##########
cpp/src/parquet/encoding.cc:
##########
@@ -4055,4 +4055,57 @@ std::unique_ptr<Decoder> MakeDictDecoder(Type::type
type_num,
}
} // namespace detail
+
+bool IsParquetVersionAtLeast2_0(ParquetVersion::type parquet_version) {
+ switch (parquet_version) {
+ case ParquetVersion::PARQUET_1_0:
+ return false;
+ case ParquetVersion::PARQUET_2_4:
+ case ParquetVersion::PARQUET_2_6:
+ return true;
+ default:
+ return parquet_version > ParquetVersion::PARQUET_1_0;
+ }
+ return false;
+}
+
+// Informed heavily by
https://github.com/apache/parquet-format/blob/master/Encodings.md
+// and
+//
https://github.com/apache/arrow-rs/blob/a15adf66a8212fdb3a0222a32130ce70cfc28cf5/parquet/src/column/writer/mod.rs#L1183
Review Comment:
BTW, it seems intuitive to fall back to BYTE_STREAM_SPLIT for FLOAT/DOUBLE.
arrow-rs does not implement BYTE_STREAM_SPLIT so it does not appear there
(CMIW).
##########
cpp/src/parquet/encoding.h:
##########
@@ -466,4 +467,9 @@ std::unique_ptr<typename EncodingTraits<DType>::Decoder>
MakeTypedDecoder(
return std::unique_ptr<OutType>(dynamic_cast<OutType*>(base.release()));
}
+PARQUET_EXPORT
Review Comment:
If the purpose of adding it to header file is for testing, I'd suggest
adding a `encoding_internal.h` instead. Then we don't need to include
parquet/properties.h as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]