pitrou commented on code in PR #35825:
URL: https://github.com/apache/arrow/pull/35825#discussion_r1242346200
##########
cpp/src/parquet/column_reader.cc:
##########
@@ -2093,15 +2093,44 @@ class FLBARecordReader : public
TypedRecordReader<FLBAType>,
std::unique_ptr<::arrow::FixedSizeBinaryBuilder> builder_;
};
-class ByteArrayChunkedRecordReader : public TypedRecordReader<ByteArrayType>,
- virtual public BinaryRecordReader {
+// TODO: Below concept could be used to simplify type assertion in C++20.
+// template <typename T>
+// concept ByteArrayTypeConcept = std::is_same<T, ByteArrayType>::value ||
+// std::is_same<T, LargeByteArrayType>::value;
+
+template <typename T>
+struct IsByteArrayType : std::false_type {};
+
+template <>
+struct IsByteArrayType<ByteArrayType> : std::true_type {};
+
+template <>
+struct IsByteArrayType<LargeByteArrayType> : std::true_type {};
Review Comment:
This is quite verbose. You can define a simple `constexpr` function instead:
```c++
template <typename T>
constexpr bool IsByteArrayType() {
return std::is_same_v<T, ByteArrayType> || std::is_same_v<T,
LargeByteArrayType>;
}
```
##########
cpp/src/parquet/arrow/arrow_reader_writer_test.cc:
##########
@@ -1342,6 +1353,23 @@ TEST_F(TestUInt32ParquetIO, Parquet_1_0_Compatibility) {
using TestStringParquetIO = TestParquetIO<::arrow::StringType>;
+#if defined(_WIN64) || defined(__LP64__)
Review Comment:
I don't understand this condition. Which platforms is it excluding and why?
Large binary data is supposed to work on every platform, so there should be
no reason to skip some platforms here.
##########
cpp/src/parquet/arrow/arrow_reader_writer_test.cc:
##########
@@ -3862,6 +3905,19 @@ TEST(TestArrowReaderAdHoc, CorruptedSchema) {
TryReadDataFile(path, ::arrow::StatusCode::IOError);
}
+#if defined(ARROW_WITH_BROTLI) && defined(__LP64__)
Review Comment:
I still don't understand what `__LP64__` is doing here. If you really want
to single out 64-bit platforms, you could instead do something like:
```c++
if (sizeof(void*) < 8) {
GTEST_SKIP() << "Test only runs on 64-bit platforms as it allocates more
than 2GB RAM";
}
```
##########
cpp/src/parquet/arrow/arrow_reader_writer_test.cc:
##########
@@ -1369,6 +1397,7 @@ TEST_F(TestStringParquetIO,
EmptyStringColumnRequiredWrite) {
using TestLargeBinaryParquetIO = TestParquetIO<::arrow::LargeBinaryType>;
+#if defined(_WIN64) || defined(__LP64__)
Review Comment:
Again, it does not seem right that you are restricting tests that used to
work on every platform (and that have no obvious reason to fail on some
platforms).
##########
cpp/src/parquet/column_reader.h:
##########
@@ -321,7 +321,8 @@ class PARQUET_EXPORT RecordReader {
static std::shared_ptr<RecordReader> Make(
const ColumnDescriptor* descr, LevelInfo leaf_info,
::arrow::MemoryPool* pool = ::arrow::default_memory_pool(),
- bool read_dictionary = false, bool read_dense_for_nullable = false);
+ bool read_dictionary = false, bool read_dense_for_nullable = false,
+ bool use_large_binary_variants = false);
Review Comment:
Since you added an argument, you have to document it in the docstring above.
Besides, it makes me uneasy that an Arrow-specific argument is making its
way here.
In any case, `use_large_binary_variants` is not easy to understand for the
user, as Parquet doesn't have any "large" binary types. It could have "arrow"
in the name, or stress that this is a purely internal flag (are users expected
to pass it?).
##########
cpp/src/parquet/encoding.cc:
##########
@@ -1238,19 +1238,33 @@ int PlainBooleanDecoder::Decode(bool* buffer, int
max_values) {
return max_values;
}
-struct ArrowBinaryHelper {
- explicit ArrowBinaryHelper(typename
EncodingTraits<ByteArrayType>::Accumulator* out) {
+template <typename DType>
+struct ArrowBinaryHelperTraits;
+
+template <>
+struct ArrowBinaryHelperTraits<ByteArrayType> {
+ static constexpr auto memory_limit = ::arrow::kBinaryMemoryLimit;
+};
+
+template <>
+struct ArrowBinaryHelperTraits<LargeByteArrayType> {
+ static constexpr auto memory_limit = ::arrow::kLargeBinaryMemoryLimit;
+};
+
+template <typename BAT>
+struct ArrowBinaryHelperBase {
+ explicit ArrowBinaryHelperBase(typename EncodingTraits<BAT>::Accumulator*
out) {
Review Comment:
FTR, this will probably conflict with the changes in the DeltaByteArray
encoder PR, which is far more desirable than this PR. So I would suggest to
postpone this PR even if we deem it ready. @wgtmac
##########
cpp/src/parquet/encoding.h:
##########
@@ -437,14 +456,16 @@ std::unique_ptr<typename EncodingTraits<DType>::Encoder>
MakeTypedEncoder(
PARQUET_EXPORT
std::unique_ptr<Decoder> MakeDecoder(
Type::type type_num, Encoding::type encoding, const ColumnDescriptor*
descr = NULLPTR,
- ::arrow::MemoryPool* pool = ::arrow::default_memory_pool());
+ ::arrow::MemoryPool* pool = ::arrow::default_memory_pool(),
+ bool use_large_binary_variants = false);
Review Comment:
Here again I think the naming should stress the peculariaties of this new
option. I'm not sure we want users to start relying on this in third-party
code, do we?
##########
cpp/src/parquet/encoding.h:
##########
@@ -140,17 +142,34 @@ template <>
struct EncodingTraits<ByteArrayType> {
using Encoder = ByteArrayEncoder;
using Decoder = ByteArrayDecoder;
+ using BinaryBuilder = ::arrow::BinaryBuilder;
/// \brief Internal helper class for decoding BYTE_ARRAY data where we can
/// overflow the capacity of a single arrow::BinaryArray
struct Accumulator {
- std::unique_ptr<::arrow::BinaryBuilder> builder;
+ std::unique_ptr<BinaryBuilder> builder;
std::vector<std::shared_ptr<::arrow::Array>> chunks;
};
using ArrowType = ::arrow::BinaryType;
using DictAccumulator = ::arrow::Dictionary32Builder<::arrow::BinaryType>;
};
+template <>
+struct EncodingTraits<LargeByteArrayType> {
+ using Encoder = LargeByteArrayEncoder;
+ using Decoder = LargeByteArrayDecoder;
+ using BinaryBuilder = ::arrow::LargeBinaryBuilder;
+
+ /// \brief Internal helper class for decoding BYTE_ARRAY data where we can
+ /// overflow the capacity of a single arrow::BinaryArray
Review Comment:
Isn't this entirely obsolete? This PR should actually enable us to write
`using Accumulator = ::arrow::LargeBinaryBuilder` here?
##########
cpp/src/parquet/arrow/reader_internal.cc:
##########
@@ -487,8 +487,9 @@ Status TransferBinary(RecordReader* reader, MemoryPool*
pool,
auto chunks = binary_reader->GetBuilderChunks();
for (auto& chunk : chunks) {
if (!chunk->type()->Equals(*logical_type_field->type())) {
- // XXX: if a LargeBinary chunk is larger than 2GB, the MSBs of offsets
- // will be lost because they are first created as int32 and then cast to
int64.
+ // If a LargeBinary chunk is larger than 2GB and
use_large_binary_variants
Review Comment:
I would keep the XXX because it is a gotcha.
##########
cpp/src/parquet/types.h:
##########
@@ -761,6 +761,18 @@ using Int96Type = PhysicalType<Type::INT96>;
using FloatType = PhysicalType<Type::FLOAT>;
using DoubleType = PhysicalType<Type::DOUBLE>;
using ByteArrayType = PhysicalType<Type::BYTE_ARRAY>;
+
+/*
+ * Parquet has defined ByteArrayType for variable length string and binary
values with a
+ * maximum length of 2^31 - 1. By default, arrow StringType and BinaryType are
used to
+ * map parquet ByteArrayType. However, arrow StringArray/BinaryArray uses
int32_t to
+ * store the offset of each string/binary value in a concatenated buffer which
may
+ * overflow (though unlikely in most cases). As arrow has defined
LargeStringType and
+ * LargeBinaryType which use int64_t as the offset type, we define
LargeByteArrayType
+ * below to indicate parquet reader/writer to use those large variants from
arrow.
+ * */
+struct LargeByteArrayType : public ByteArrayType {};
Review Comment:
Again it makes me uneasy that we are adding a fake Parquet type at the
toplevel just to deal with some details of Arrow reading.
Minimally I would request that this is clearly flagged as internal and not a
real type:
```c++
namespace internal {
// DO NOT USE THIS in third-party code.
//
// This is a type marker passed internally when reading Parquet data
// to Arrow, if the user requested to read binary data as large binary
// (with `ArrowReaderProperties::use_large_binary_variants`).
//
// Also, this might be removed if we find a better way of implementing
// that feature.
struct FakeLargeByteArrayType : public ByteArrayType {};
}
```
##########
cpp/src/parquet/arrow/arrow_reader_writer_test.cc:
##########
@@ -4605,6 +4667,27 @@ TEST_F(TestArrowReadDeltaEncoding, DeltaByteArray) {
::arrow::AssertTablesEqual(*actual_table, *expect_table, false);
}
+TEST_F(TestArrowReadDeltaEncoding, DeltaByteArrayWithLargeBinaryVariant) {
+ std::shared_ptr<::arrow::Table> actual_table, expect_table;
+ ArrowReaderProperties properties;
+ properties.set_use_large_binary_variants(true);
+
+ ReadTableFromParquetFile("delta_byte_array.parquet", properties,
&actual_table);
+
+ auto convert_options = ::arrow::csv::ConvertOptions::Defaults();
+ std::vector<std::string> column_names = {
+ "c_customer_id", "c_salutation", "c_first_name",
+ "c_last_name", "c_preferred_cust_flag", "c_birth_country",
+ "c_login", "c_email_address", "c_last_review_date"};
+ for (auto name : column_names) {
+ convert_options.column_types[name] = ::arrow::large_utf8();
+ }
+ convert_options.strings_can_be_null = true;
+ ReadTableFromCSVFile("delta_byte_array_expect.csv", convert_options,
&expect_table);
Review Comment:
Looks like you could factor this out in the test fixture.
##########
cpp/src/parquet/arrow/arrow_reader_writer_test.cc:
##########
@@ -3862,6 +3905,19 @@ TEST(TestArrowReaderAdHoc, CorruptedSchema) {
TryReadDataFile(path, ::arrow::StatusCode::IOError);
}
+#if defined(ARROW_WITH_BROTLI) && defined(__LP64__)
Review Comment:
I also see that this test takes 18 seconds in debug mode. This seems a bit
excessive :-/
##########
cpp/src/parquet/encoding.h:
##########
@@ -140,15 +143,37 @@ template <>
struct EncodingTraits<ByteArrayType> {
using Encoder = ByteArrayEncoder;
using Decoder = ByteArrayDecoder;
+ using BinaryBuilder = ::arrow::BinaryBuilder;
Review Comment:
Ping @arthurpassos on this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]