adriangb commented on issue #6002:
URL: https://github.com/apache/arrow-rs/issues/6002#issuecomment-2219009292
I took a crack at using `MetadataLoader` since I happen to have all of the
parquet file bytes in memory when writing (although this is not necessarily the
case if your'e streaming them somewhere).
My approach was to manually grab the footer based on the footer size
declared in the penultimate 4 bytes of the file and save that. But the
metadata size declared in the footer seems to not include the Page Index, and
I'm not sure how I'd calculate the start location of the Page Index (and other
stuff like bloom filters).
My implementation looks somewhat like:
```rust
#[derive(Debug, Clone)]
struct AsyncBytes {
file_size: usize,
inner: Bytes,
}
impl AsyncBytes {
fn new(file_size: usize, inner: Bytes) -> Self {
Self {
file_size,
inner,
}
}
}
impl MetadataFetch for AsyncBytes {
fn fetch(&mut self, range: Range<usize>) -> BoxFuture<'_,
ParquetResult<Bytes>> {
// check that the range is within the metadata section
let available_range = self.file_size -
self.inner.len()..self.file_size;
if !(available_range.start <= range.start && available_range.end >=
range.end) {
return async move {
let err = format!("Attempted to fetch data from outside
metadata section: range={:?}, available_range={:?}", range, available_range);
Err(parquet::errors::ParquetError::General(err))
}
.boxed();
}
// adjust the range to be within the data section
let range = range.start - available_range.start..range.end -
available_range.start;
let data = self.inner.slice(range.start..range.end);
async move { Ok(data) }.boxed()
}
}
/// Load parquet metadata, including the page index, from bytes.
/// This assumes the entire metadata (and no more) is in the provided bytes.
/// Although this method is async, no IO is performed.
pub async fn load_metadata(file_size: usize, serialized_parquet_metadata:
Bytes) -> ParquetResult<Arc<ParquetMetaData>> {
let loaded_metadata = decode_metadata(&serialized_parquet_metadata)?;
let reader = AsyncBytes::new(file_size, serialized_parquet_metadata);
let mut metadata = MetadataLoader::new(reader, loaded_metadata);
metadata.load_page_index(true, true).await?;
Ok(Arc::new(metadata.finish()))
}
```
Any advice or ideas would be appreciated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]