blackmwk commented on code in PR #2181:
URL: https://github.com/apache/iceberg-rust/pull/2181#discussion_r2886954327
##########
crates/iceberg/src/arrow/reader.rs:
##########
@@ -510,9 +554,11 @@ impl ArrowReader {
)
.with_preload_column_index(true)
.with_preload_offset_index(true)
- .with_preload_page_index(should_load_page_index);
+ .with_preload_page_index(should_load_page_index)
+ .with_range_coalesce_bytes(parquet_read_options.range_coalesce_bytes)
Review Comment:
Also include the previous three arguments in `ParquetReadOptions`?
##########
crates/iceberg/src/arrow/reader.rs:
##########
@@ -60,14 +60,42 @@ use crate::spec::{Datum, NameMapping, NestedField,
PrimitiveType, Schema, Type};
use crate::utils::available_parallelism;
use crate::{Error, ErrorKind};
+/// Default gap between byte ranges below which they are coalesced into a
+/// single request. Matches object_store's `OBJECT_STORE_COALESCE_DEFAULT`.
+const DEFAULT_RANGE_COALESCE_BYTES: u64 = 1024 * 1024;
+
+/// Default maximum number of coalesced byte ranges fetched concurrently.
+/// Matches object_store's `OBJECT_STORE_COALESCE_PARALLEL`.
+const DEFAULT_RANGE_FETCH_CONCURRENCY: usize = 10;
+
+/// Options for tuning Parquet file I/O.
+#[derive(Clone, Copy, Debug)]
+pub(crate) struct ParquetReadOptions {
+ pub metadata_size_hint: Option<usize>,
Review Comment:
Please don't use pub field, for immutable objects like this, please use
`TypedBuilder` and getter function.
##########
crates/iceberg/src/arrow/reader.rs:
##########
@@ -1710,18 +1756,23 @@ pub struct ArrowFileReader {
preload_offset_index: bool,
preload_page_index: bool,
metadata_size_hint: Option<usize>,
+ range_coalesce_bytes: u64,
Review Comment:
nit: We should directly pass the `ParquetReadOption`?
##########
crates/iceberg/src/arrow/reader.rs:
##########
@@ -1763,6 +1826,49 @@ impl AsyncFileReader for ArrowFileReader {
)
}
+ /// Override the default `get_byte_ranges` which calls `get_bytes`
sequentially.
+ /// The parquet reader calls this to fetch column chunks for a row group,
so
+ /// without this override each column chunk is a serial round-trip to
object storage.
+ /// Adapted from object_store's `coalesce_ranges` in `util.rs`.
+ fn get_byte_ranges(
Review Comment:
Could we add some ut for this function? It should be relatively easy to mock
the inner `FileRead`, I'm a little worring about the correctness of the logic
here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]