westonpace commented on code in PR #36779:
URL: https://github.com/apache/arrow/pull/36779#discussion_r1304717495
##########
cpp/src/parquet/arrow/reader.cc:
##########
@@ -114,6 +116,32 @@ class ColumnReaderImpl : public ColumnReader {
return Status::OK();
}
+ ::arrow::Result<std::shared_ptr<::arrow::ChunkedArray>> NextBatch(
+ int64_t batch_size) final {
+ std::shared_ptr<::arrow::ChunkedArray> out;
+ RETURN_NOT_OK(NextBatch(batch_size, &out));
+ return out;
+ }
+
+ Future<std::shared_ptr<ChunkedArray>> NextBatchAsync(
+ int64_t batch_size, ::arrow::internal::Executor* io_executor,
+ ::arrow::internal::Executor* cpu_executor) final {
+ Future<> load_fut = ::arrow::DeferNotOk(
+ io_executor->Submit([this, batch_size] { return LoadBatch(batch_size);
}));
+ return load_fut.Then(
+ [this, batch_size, cpu_executor]() ->
Future<std::shared_ptr<ChunkedArray>> {
+ return DeferNotOk(cpu_executor->Submit(
+ [this, batch_size]() -> Result<std::shared_ptr<ChunkedArray>> {
+ std::shared_ptr<ChunkedArray> out;
+ RETURN_NOT_OK(BuildArray(batch_size, &out));
+ for (int x = 0; x < out->num_chunks(); x++) {
+ RETURN_NOT_OK(out->chunk(x)->Validate());
Review Comment:
I'm only guessing as I didn't write the original implementation but my guess
is that, for security reasons, it is almost always required to validate because
a malicious user could otherwise craft a parquet file that triggers buffer
overflow. For example, they could store a list array where one of the offsets
is way out of range.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]