Re: [PR] GH-40636: [C++][Parquet] Improve fallback encoding choice in column writer. [arrow]

via GitHub Thu, 25 Apr 2024 00:36:42 -0700


wgtmac commented on code in PR #40957:
URL: https://github.com/apache/arrow/pull/40957#discussion_r1579012750



##########
cpp/src/parquet/encoding.cc:
##########
@@ -4055,4 +4055,57 @@ std::unique_ptr<Decoder> MakeDictDecoder(Type::type 
type_num,
 }
 
 }  // namespace detail
+
+bool IsParquetVersionAtLeast2_0(ParquetVersion::type parquet_version) {

Review Comment:
   Should we refactor this to `bool 
IsParquetVersionAtLeast(ParquetVersion::type parquet_version, 
ParquetVersion::type target_version)`? BYTE_STREAM_SPLIT was introduced in 
PARQUET_2_8 and will be extended in PARQUET_2_11. We need a way to check 
minimum version of an encoding.



##########
cpp/src/parquet/encoding.cc:
##########
@@ -4055,4 +4055,57 @@ std::unique_ptr<Decoder> MakeDictDecoder(Type::type 
type_num,
 }
 
 }  // namespace detail
+
+bool IsParquetVersionAtLeast2_0(ParquetVersion::type parquet_version) {
+  switch (parquet_version) {
+    case ParquetVersion::PARQUET_1_0:
+      return false;
+    case ParquetVersion::PARQUET_2_4:
+    case ParquetVersion::PARQUET_2_6:
+      return true;
+    default:
+      return parquet_version > ParquetVersion::PARQUET_1_0;
+  }
+  return false;
+}
+
+// Informed heavily by 
https://github.com/apache/parquet-format/blob/master/Encodings.md
+// and
+// 
https://github.com/apache/arrow-rs/blob/a15adf66a8212fdb3a0222a32130ce70cfc28cf5/parquet/src/column/writer/mod.rs#L1183
+Encoding::type ChooseNonDictEncoding(Type::type data_type,

Review Comment:
   We may also need input logical type here because FIXED_LEN_BYTE_ARRAY with 
logical type FLOAT16|DECIMAL should support BYTE_STREAM_SPLIT in PARQUET_2_11.
   
   cc @pitrou 



##########
cpp/src/parquet/encoding.cc:
##########
@@ -4055,4 +4055,57 @@ std::unique_ptr<Decoder> MakeDictDecoder(Type::type 
type_num,
 }
 
 }  // namespace detail
+
+bool IsParquetVersionAtLeast2_0(ParquetVersion::type parquet_version) {
+  switch (parquet_version) {
+    case ParquetVersion::PARQUET_1_0:
+      return false;
+    case ParquetVersion::PARQUET_2_4:
+    case ParquetVersion::PARQUET_2_6:
+      return true;
+    default:
+      return parquet_version > ParquetVersion::PARQUET_1_0;
+  }
+  return false;
+}
+
+// Informed heavily by 
https://github.com/apache/parquet-format/blob/master/Encodings.md
+// and
+// 
https://github.com/apache/arrow-rs/blob/a15adf66a8212fdb3a0222a32130ce70cfc28cf5/parquet/src/column/writer/mod.rs#L1183

Review Comment:
   BTW, it seems intuitive to fall back to BYTE_STREAM_SPLIT for FLOAT/DOUBLE. 
arrow-rs does not implement BYTE_STREAM_SPLIT so it does not appear there 
(CMIW).



##########
cpp/src/parquet/encoding.h:
##########
@@ -466,4 +467,9 @@ std::unique_ptr<typename EncodingTraits<DType>::Decoder> 
MakeTypedDecoder(
   return std::unique_ptr<OutType>(dynamic_cast<OutType*>(base.release()));
 }
 
+PARQUET_EXPORT

Review Comment:
   If the purpose of adding it to header file is for testing, I'd suggest 
adding a `encoding_internal.h` instead. Then we don't need to include 
parquet/properties.h as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-40636: [C++][Parquet] Improve fallback encoding choice in column writer. [arrow]

Reply via email to