[GitHub] [arrow] pitrou commented on a diff in pull request #35825: GH-32723: [C++][Parquet] Add option to use LARGE* variants of binary types

via GitHub Thu, 27 Jul 2023 08:06:30 -0700


pitrou commented on code in PR #35825:
URL: https://github.com/apache/arrow/pull/35825#discussion_r1242346200



##########
cpp/src/parquet/column_reader.cc:
##########
@@ -2093,15 +2093,44 @@ class FLBARecordReader : public 
TypedRecordReader<FLBAType>,
   std::unique_ptr<::arrow::FixedSizeBinaryBuilder> builder_;
 };
 
-class ByteArrayChunkedRecordReader : public TypedRecordReader<ByteArrayType>,
-                                     virtual public BinaryRecordReader {
+// TODO: Below concept could be used to simplify type assertion in C++20.
+// template <typename T>
+// concept ByteArrayTypeConcept = std::is_same<T, ByteArrayType>::value ||
+//                                std::is_same<T, LargeByteArrayType>::value;
+
+template <typename T>
+struct IsByteArrayType : std::false_type {};
+
+template <>
+struct IsByteArrayType<ByteArrayType> : std::true_type {};
+
+template <>
+struct IsByteArrayType<LargeByteArrayType> : std::true_type {};

Review Comment:
   This is quite verbose. You can define a simple `constexpr` function instead:
   ```c++
   template <typename T>
   constexpr bool IsByteArrayType() {
     return std::is_same_v<T, ByteArrayType> || std::is_same_v<T, 
LargeByteArrayType>;
   }
   ```



##########
cpp/src/parquet/arrow/arrow_reader_writer_test.cc:
##########
@@ -1342,6 +1353,23 @@ TEST_F(TestUInt32ParquetIO, Parquet_1_0_Compatibility) {
 
 using TestStringParquetIO = TestParquetIO<::arrow::StringType>;
 
+#if defined(_WIN64) || defined(__LP64__)

Review Comment:
   I don't understand this condition. Which platforms is it excluding and why?
   
   Large binary data is supposed to work on every platform, so there should be 
no reason to skip some platforms here.



##########
cpp/src/parquet/arrow/arrow_reader_writer_test.cc:
##########
@@ -3862,6 +3905,19 @@ TEST(TestArrowReaderAdHoc, CorruptedSchema) {
   TryReadDataFile(path, ::arrow::StatusCode::IOError);
 }
 
+#if defined(ARROW_WITH_BROTLI) && defined(__LP64__)

Review Comment:
   I still don't understand what `__LP64__` is doing here. If you really want 
to single out 64-bit platforms, you could instead do something like:
   ```c++
     if (sizeof(void*) < 8) {
       GTEST_SKIP() << "Test only runs on 64-bit platforms as it allocates more 
than 2GB RAM";
     }
   ```



##########
cpp/src/parquet/arrow/arrow_reader_writer_test.cc:
##########
@@ -1369,6 +1397,7 @@ TEST_F(TestStringParquetIO, 
EmptyStringColumnRequiredWrite) {
 
 using TestLargeBinaryParquetIO = TestParquetIO<::arrow::LargeBinaryType>;
 
+#if defined(_WIN64) || defined(__LP64__)

Review Comment:
   Again, it does not seem right that you are restricting tests that used to 
work on every platform (and that have no obvious reason to fail on some 
platforms).



##########
cpp/src/parquet/column_reader.h:
##########
@@ -321,7 +321,8 @@ class PARQUET_EXPORT RecordReader {
   static std::shared_ptr<RecordReader> Make(
       const ColumnDescriptor* descr, LevelInfo leaf_info,
       ::arrow::MemoryPool* pool = ::arrow::default_memory_pool(),
-      bool read_dictionary = false, bool read_dense_for_nullable = false);
+      bool read_dictionary = false, bool read_dense_for_nullable = false,
+      bool use_large_binary_variants = false);

Review Comment:
   Since you added an argument, you have to document it in the docstring above.
   
   Besides, it makes me uneasy that an Arrow-specific argument is making its 
way here.
   
   In any case, `use_large_binary_variants` is not easy to understand for the 
user, as Parquet doesn't have any "large" binary types. It could have "arrow" 
in the name, or stress that this is a purely internal flag (are users expected 
to pass it?).
   



##########
cpp/src/parquet/encoding.cc:
##########
@@ -1238,19 +1238,33 @@ int PlainBooleanDecoder::Decode(bool* buffer, int 
max_values) {
   return max_values;
 }
 
-struct ArrowBinaryHelper {
-  explicit ArrowBinaryHelper(typename 
EncodingTraits<ByteArrayType>::Accumulator* out) {
+template <typename DType>
+struct ArrowBinaryHelperTraits;
+
+template <>
+struct ArrowBinaryHelperTraits<ByteArrayType> {
+  static constexpr auto memory_limit = ::arrow::kBinaryMemoryLimit;
+};
+
+template <>
+struct ArrowBinaryHelperTraits<LargeByteArrayType> {
+  static constexpr auto memory_limit = ::arrow::kLargeBinaryMemoryLimit;
+};
+
+template <typename BAT>
+struct ArrowBinaryHelperBase {
+  explicit ArrowBinaryHelperBase(typename EncodingTraits<BAT>::Accumulator* 
out) {

Review Comment:
   FTR, this will probably conflict with the changes in the DeltaByteArray 
encoder PR, which is far more desirable than this PR. So I would suggest to 
postpone this PR even if we deem it ready. @wgtmac 



##########
cpp/src/parquet/encoding.h:
##########
@@ -437,14 +456,16 @@ std::unique_ptr<typename EncodingTraits<DType>::Encoder> 
MakeTypedEncoder(
 PARQUET_EXPORT
 std::unique_ptr<Decoder> MakeDecoder(
     Type::type type_num, Encoding::type encoding, const ColumnDescriptor* 
descr = NULLPTR,
-    ::arrow::MemoryPool* pool = ::arrow::default_memory_pool());
+    ::arrow::MemoryPool* pool = ::arrow::default_memory_pool(),
+    bool use_large_binary_variants = false);

Review Comment:
   Here again I think the naming should stress the peculariaties of this new 
option. I'm not sure we want users to start relying on this in third-party 
code, do we?



##########
cpp/src/parquet/encoding.h:
##########
@@ -140,17 +142,34 @@ template <>
 struct EncodingTraits<ByteArrayType> {
   using Encoder = ByteArrayEncoder;
   using Decoder = ByteArrayDecoder;
+  using BinaryBuilder = ::arrow::BinaryBuilder;
 
   /// \brief Internal helper class for decoding BYTE_ARRAY data where we can
   /// overflow the capacity of a single arrow::BinaryArray
   struct Accumulator {
-    std::unique_ptr<::arrow::BinaryBuilder> builder;
+    std::unique_ptr<BinaryBuilder> builder;
     std::vector<std::shared_ptr<::arrow::Array>> chunks;
   };
   using ArrowType = ::arrow::BinaryType;
   using DictAccumulator = ::arrow::Dictionary32Builder<::arrow::BinaryType>;
 };
 
+template <>
+struct EncodingTraits<LargeByteArrayType> {
+  using Encoder = LargeByteArrayEncoder;
+  using Decoder = LargeByteArrayDecoder;
+  using BinaryBuilder = ::arrow::LargeBinaryBuilder;
+
+  /// \brief Internal helper class for decoding BYTE_ARRAY data where we can
+  /// overflow the capacity of a single arrow::BinaryArray

Review Comment:
   Isn't this entirely obsolete? This PR should actually enable us to write 
`using Accumulator = ::arrow::LargeBinaryBuilder` here?



##########
cpp/src/parquet/arrow/reader_internal.cc:
##########
@@ -487,8 +487,9 @@ Status TransferBinary(RecordReader* reader, MemoryPool* 
pool,
   auto chunks = binary_reader->GetBuilderChunks();
   for (auto& chunk : chunks) {
     if (!chunk->type()->Equals(*logical_type_field->type())) {
-      // XXX: if a LargeBinary chunk is larger than 2GB, the MSBs of offsets
-      // will be lost because they are first created as int32 and then cast to 
int64.
+      // If a LargeBinary chunk is larger than 2GB and 
use_large_binary_variants

Review Comment:
   I would keep the XXX because it is a gotcha.



##########
cpp/src/parquet/types.h:
##########
@@ -761,6 +761,18 @@ using Int96Type = PhysicalType<Type::INT96>;
 using FloatType = PhysicalType<Type::FLOAT>;
 using DoubleType = PhysicalType<Type::DOUBLE>;
 using ByteArrayType = PhysicalType<Type::BYTE_ARRAY>;
+
+/*
+ * Parquet has defined ByteArrayType for variable length string and binary 
values with a
+ * maximum length of 2^31 - 1. By default, arrow StringType and BinaryType are 
used to
+ * map parquet ByteArrayType. However, arrow StringArray/BinaryArray uses 
int32_t to
+ * store the offset of each string/binary value in a concatenated buffer which 
may
+ * overflow (though unlikely in most cases). As arrow has defined 
LargeStringType and
+ * LargeBinaryType which use int64_t as the offset type, we define 
LargeByteArrayType
+ * below to indicate parquet reader/writer to use those large variants from 
arrow.
+ * */
+struct LargeByteArrayType : public ByteArrayType {};

Review Comment:
   Again it makes me uneasy that we are adding a fake Parquet type at the 
toplevel just to deal with some details of Arrow reading.
   
   Minimally I would request that this is clearly flagged as internal and not a 
real type:
   ```c++
   namespace internal {
     // DO NOT USE THIS in third-party code.
     //
     // This is a type marker passed internally when reading Parquet data
     // to Arrow, if the user requested to read binary data as large binary
     // (with `ArrowReaderProperties::use_large_binary_variants`).
     //
     // Also, this might be removed if we find a better way of implementing
     // that feature.
     struct FakeLargeByteArrayType : public ByteArrayType {};
   }
   ```



##########
cpp/src/parquet/arrow/arrow_reader_writer_test.cc:
##########
@@ -4605,6 +4667,27 @@ TEST_F(TestArrowReadDeltaEncoding, DeltaByteArray) {
   ::arrow::AssertTablesEqual(*actual_table, *expect_table, false);
 }
 
+TEST_F(TestArrowReadDeltaEncoding, DeltaByteArrayWithLargeBinaryVariant) {
+  std::shared_ptr<::arrow::Table> actual_table, expect_table;
+  ArrowReaderProperties properties;
+  properties.set_use_large_binary_variants(true);
+
+  ReadTableFromParquetFile("delta_byte_array.parquet", properties, 
&actual_table);
+
+  auto convert_options = ::arrow::csv::ConvertOptions::Defaults();
+  std::vector<std::string> column_names = {
+      "c_customer_id", "c_salutation",          "c_first_name",
+      "c_last_name",   "c_preferred_cust_flag", "c_birth_country",
+      "c_login",       "c_email_address",       "c_last_review_date"};
+  for (auto name : column_names) {
+    convert_options.column_types[name] = ::arrow::large_utf8();
+  }
+  convert_options.strings_can_be_null = true;
+  ReadTableFromCSVFile("delta_byte_array_expect.csv", convert_options, 
&expect_table);

Review Comment:
   Looks like you could factor this out in the test fixture.



##########
cpp/src/parquet/arrow/arrow_reader_writer_test.cc:
##########
@@ -3862,6 +3905,19 @@ TEST(TestArrowReaderAdHoc, CorruptedSchema) {
   TryReadDataFile(path, ::arrow::StatusCode::IOError);
 }
 
+#if defined(ARROW_WITH_BROTLI) && defined(__LP64__)

Review Comment:
   I also see that this test takes 18 seconds in debug mode. This seems a bit 
excessive :-/



##########
cpp/src/parquet/encoding.h:
##########
@@ -140,15 +143,37 @@ template <>
 struct EncodingTraits<ByteArrayType> {
   using Encoder = ByteArrayEncoder;
   using Decoder = ByteArrayDecoder;
+  using BinaryBuilder = ::arrow::BinaryBuilder;

Review Comment:
   Ping @arthurpassos on this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] pitrou commented on a diff in pull request #35825: GH-32723: [C++][Parquet] Add option to use LARGE* variants of binary types

Reply via email to