[GitHub] [arrow] wgtmac commented on a diff in pull request #35825: GH-32723: [C++][Parquet] Add option to use LARGE* variants of binary types

via GitHub Thu, 08 Jun 2023 18:52:24 -0700


wgtmac commented on code in PR #35825:
URL: https://github.com/apache/arrow/pull/35825#discussion_r1223737513



##########
cpp/src/parquet/column_reader.h:
##########
@@ -321,7 +321,8 @@ class PARQUET_EXPORT RecordReader {
   static std::shared_ptr<RecordReader> Make(
       const ColumnDescriptor* descr, LevelInfo leaf_info,
       ::arrow::MemoryPool* pool = ::arrow::default_memory_pool(),
-      bool read_dictionary = false, bool read_dense_for_nullable = false);
+      bool read_dictionary = false, bool read_dense_for_nullable = false,
+      bool use_binary_string_large_variants = false);

Review Comment:
   ```suggestion
         bool use_large_binary_variants = false);
   ```



##########
cpp/src/parquet/column_reader.cc:
##########
@@ -2241,12 +2289,25 @@ std::shared_ptr<RecordReader> 
MakeByteArrayRecordReader(const ColumnDescriptor*
   }
 }
 
+std::shared_ptr<RecordReader> MakeLargeByteArrayRecordReader(
+    const ColumnDescriptor* descr, LevelInfo leaf_info, ::arrow::MemoryPool* 
pool,
+    bool read_dictionary, bool read_dense_for_nullable) {
+  if (read_dictionary) {
+    return std::make_shared<LargeByteArrayDictionaryRecordReader>(
+        descr, leaf_info, pool, read_dense_for_nullable);
+  } else {
+    return std::make_shared<LargeByteArrayChunkedRecordReader>(descr, 
leaf_info, pool,
+                                                               
read_dense_for_nullable);
+  }
+}
+
 }  // namespace
 
 std::shared_ptr<RecordReader> RecordReader::Make(const ColumnDescriptor* descr,
                                                  LevelInfo leaf_info, 
MemoryPool* pool,
                                                  bool read_dictionary,
-                                                 bool read_dense_for_nullable) 
{
+                                                 bool read_dense_for_nullable,
+                                                 bool 
use_binary_string_large_variants) {

Review Comment:
   ```suggestion
                                                    bool 
use_large_binary_variants) {
   ```



##########
cpp/src/parquet/page_index.cc:
##########
@@ -853,6 +853,7 @@ std::unique_ptr<ColumnIndex> ColumnIndex::Make(const 
ColumnDescriptor& descr,
       return std::make_unique<TypedColumnIndexImpl<DoubleType>>(descr, 
column_index);
     case Type::BYTE_ARRAY:
       return std::make_unique<TypedColumnIndexImpl<ByteArrayType>>(descr, 
column_index);
+      // TODO AP FIX ARTHUR PASSOS

Review Comment:
   Please remove these TODOs



##########
cpp/src/parquet/arrow/arrow_reader_writer_test.cc:
##########
@@ -3862,6 +3868,18 @@ TEST(TestArrowReaderAdHoc, CorruptedSchema) {
   TryReadDataFile(path, ::arrow::StatusCode::IOError);
 }
 
+TEST(TestArrowParquet, LargeByteArray) {

Review Comment:
   I mean the former one. The key point is to make sure LargeStringArray works 
perfect with  a non-overflow file.



##########
cpp/src/parquet/encoding.h:
##########
@@ -140,15 +143,37 @@ template <>
 struct EncodingTraits<ByteArrayType> {
   using Encoder = ByteArrayEncoder;
   using Decoder = ByteArrayDecoder;
+  using BinaryBuilder = ::arrow::BinaryBuilder;

Review Comment:
   You can simply get builder type by 
`::arrow::TypeTraits<ArrowType>::BuilderType`. Does this help to remove the 
`BinaryBuilder`? My point is not adding any item that is not generic to 
`EncodingTraits`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] wgtmac commented on a diff in pull request #35825: GH-32723: [C++][Parquet] Add option to use LARGE* variants of binary types

Reply via email to