adriangb commented on issue #6002:
URL: https://github.com/apache/arrow-rs/issues/6002#issuecomment-2273419127
So here's what I've been working with:
```rust
/// Load parquet metadata, including the page index, from bytes.
/// This assumes the entire metadata (and no more) is in the provided bytes.
/// Although this method is async, no IO is performed.
pub async fn load_metadata(
file_size: usize,
serialized_parquet_metadata: Bytes,
) -> ParquetResult<Arc<ParquetMetaData>> {
let metadata_length = serialized_parquet_metadata.len();
let mut reader = MaskedBytes::new(
Box::new(AsyncBytes::new(serialized_parquet_metadata)),
file_size - metadata_length..file_size,
);
let metadata = MetadataLoader::load(&mut reader, file_size, None).await?;
let loaded_metadata = metadata.finish();
let mut metadata = MetadataLoader::new(&mut reader, loaded_metadata);
metadata.load_page_index(true, true).await?;
Ok(Arc::new(metadata.finish()))
}
```
<details>
<summary>Supporting code</summary>
```rust
/// Adapt a `Bytes` to a `MetadataFetch` implementation.
struct AsyncBytes {
data: Bytes,
}
impl AsyncBytes {
fn new(data: Bytes) -> Self {
Self { data }
}
}
impl MetadataFetch for AsyncBytes {
fn fetch(&mut self, range: Range<usize>) -> BoxFuture<'_,
ParquetResult<Bytes>> {
async move { Ok(self.data.slice(range.start..range.end)) }.boxed()
}
}
/// A `MetadataFetch` implementation that reads from a subset of the full
data
/// while accepting ranges that address the full data.
struct MaskedBytes {
inner: Box<dyn MetadataFetch + Send>,
inner_range: Range<usize>,
}
impl MaskedBytes {
fn new(inner: Box<dyn MetadataFetch + Send>, inner_range: Range<usize>)
-> Self {
Self { inner, inner_range }
}
}
impl MetadataFetch for &mut MaskedBytes {
fn fetch(&mut self, range: Range<usize>) -> BoxFuture<'_,
ParquetResult<Bytes>> {
// check that the range is within the metadata section
let inner_range = self.inner_range.clone();
if !(inner_range.start <= range.start && inner_range.end >=
range.end) {
return async move {
let err = format!(
"Attempted to fetch data from outside metadata section:
range={range:?}, available_range={inner_range:?}",
);
Err(parquet::errors::ParquetError::General(err))
}
.boxed();
}
// adjust the range to be within the data section
let range = range.start - self.inner_range.start..range.end -
self.inner_range.start;
self.inner.fetch(range)
}
}
```
</details>
1.
Sorry I didn't fully understand the question. I think the API looks good on
the surface and pending internal details should work.
2.
That offset adjustment would be 0 if you (1) have the whole file or (2) are
loading metadata dumped by #6197.
So maybe v0 of this API assumes it's one of those cases and doesn't adjust
offsets at all, but I'm open to alternatives.
3.
As you point out this might be hard to integrate with `MetadataLoader`
because `MetadataLoader` is async and expects to be able to make many async
calls to load data. We'd have to do some pretty aggressive refactoring to have
rework `MetadataLoader` to be some sort of push based parser, or make some
lower level push based parser that both `MetadataLoader` and
`ParquetMetaDataDecoder` can rely on.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]