wgtmac commented on code in PR #35825: URL: https://github.com/apache/arrow/pull/35825#discussion_r1223737513
########## cpp/src/parquet/column_reader.h: ########## @@ -321,7 +321,8 @@ class PARQUET_EXPORT RecordReader { static std::shared_ptr<RecordReader> Make( const ColumnDescriptor* descr, LevelInfo leaf_info, ::arrow::MemoryPool* pool = ::arrow::default_memory_pool(), - bool read_dictionary = false, bool read_dense_for_nullable = false); + bool read_dictionary = false, bool read_dense_for_nullable = false, + bool use_binary_string_large_variants = false); Review Comment: ```suggestion bool use_large_binary_variants = false); ``` ########## cpp/src/parquet/column_reader.cc: ########## @@ -2241,12 +2289,25 @@ std::shared_ptr<RecordReader> MakeByteArrayRecordReader(const ColumnDescriptor* } } +std::shared_ptr<RecordReader> MakeLargeByteArrayRecordReader( + const ColumnDescriptor* descr, LevelInfo leaf_info, ::arrow::MemoryPool* pool, + bool read_dictionary, bool read_dense_for_nullable) { + if (read_dictionary) { + return std::make_shared<LargeByteArrayDictionaryRecordReader>( + descr, leaf_info, pool, read_dense_for_nullable); + } else { + return std::make_shared<LargeByteArrayChunkedRecordReader>(descr, leaf_info, pool, + read_dense_for_nullable); + } +} + } // namespace std::shared_ptr<RecordReader> RecordReader::Make(const ColumnDescriptor* descr, LevelInfo leaf_info, MemoryPool* pool, bool read_dictionary, - bool read_dense_for_nullable) { + bool read_dense_for_nullable, + bool use_binary_string_large_variants) { Review Comment: ```suggestion bool use_large_binary_variants) { ``` ########## cpp/src/parquet/page_index.cc: ########## @@ -853,6 +853,7 @@ std::unique_ptr<ColumnIndex> ColumnIndex::Make(const ColumnDescriptor& descr, return std::make_unique<TypedColumnIndexImpl<DoubleType>>(descr, column_index); case Type::BYTE_ARRAY: return std::make_unique<TypedColumnIndexImpl<ByteArrayType>>(descr, column_index); + // TODO AP FIX ARTHUR PASSOS Review Comment: Please remove these TODOs ########## cpp/src/parquet/arrow/arrow_reader_writer_test.cc: ########## @@ -3862,6 +3868,18 @@ TEST(TestArrowReaderAdHoc, CorruptedSchema) { TryReadDataFile(path, ::arrow::StatusCode::IOError); } +TEST(TestArrowParquet, LargeByteArray) { Review Comment: I mean the former one. The key point is to make sure LargeStringArray works perfect with a non-overflow file. ########## cpp/src/parquet/encoding.h: ########## @@ -140,15 +143,37 @@ template <> struct EncodingTraits<ByteArrayType> { using Encoder = ByteArrayEncoder; using Decoder = ByteArrayDecoder; + using BinaryBuilder = ::arrow::BinaryBuilder; Review Comment: You can simply get builder type by `::arrow::TypeTraits<ArrowType>::BuilderType`. Does this help to remove the `BinaryBuilder`? My point is not adding any item that is not generic to `EncodingTraits`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org