EmilyMatt commented on code in PR #8930:
URL: https://github.com/apache/arrow-rs/pull/8930#discussion_r2683965610
##########
arrow-avro/src/reader/async_reader/builder.rs:
##########
@@ -0,0 +1,149 @@
+use crate::codec::AvroFieldBuilder;
+use crate::reader::async_reader::ReaderState;
+use crate::reader::header::{Header, HeaderDecoder};
+use crate::reader::record::RecordDecoder;
+use crate::reader::{AsyncAvroReader, AsyncFileReader, Decoder};
+use crate::schema::{AvroSchema, FingerprintAlgorithm};
+use arrow_schema::{ArrowError, SchemaRef};
+use indexmap::IndexMap;
+use std::ops::Range;
+
+/// Builder for an asynchronous Avro file reader.
+pub struct AsyncAvroReaderBuilder<R: AsyncFileReader> {
+ pub(super) reader: R,
+ pub(super) file_size: u64,
+ pub(super) schema: SchemaRef,
+ pub(super) batch_size: usize,
+ pub(super) range: Option<Range<u64>>,
+ pub(super) reader_schema: Option<AvroSchema>,
+}
+
+impl<R: AsyncFileReader> AsyncAvroReaderBuilder<R> {
+ /// Specify a byte range to read from the Avro file.
+ /// If this is provided, the reader will read all the blocks within the
specified range,
+ /// if the range ends mid-block, it will attempt to fetch the remaining
bytes to complete the block,
+ /// but no further blocks will be read.
+ /// If this is omitted, the full file will be read.
+ pub fn with_range(self, range: Range<u64>) -> Self {
+ Self {
+ range: Some(range),
+ ..self
+ }
+ }
+
+ /// Specify a reader schema to use when reading the Avro file.
+ /// This can be useful to project specific columns or handle schema
evolution.
+ /// If this is not provided, the schema will be derived from the Arrow
schema provided.
+ pub fn with_reader_schema(self, reader_schema: AvroSchema) -> Self {
+ Self {
+ reader_schema: Some(reader_schema),
+ ..self
+ }
+ }
+
+ async fn read_header(&mut self) -> Result<(Header, u64), ArrowError> {
+ let mut decoder = HeaderDecoder::default();
+ let mut position = 0;
+ loop {
+ let range_to_fetch = position..(position + 64 *
1024).min(self.file_size);
Review Comment:
Not really, my usual files have a smaller header but I figured this is a
small enough value to be inconsequential for fetches and will almost certainly
mean we don't have to run the loop more than once.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]