alamb commented on code in PR #9548:
URL: https://github.com/apache/arrow-datafusion/pull/9548#discussion_r1519644955
##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -540,6 +541,8 @@ async fn fetch_statistics(
pub struct ParquetSink {
/// Config options for writing data
config: FileSinkConfig,
+ /// File metadata from successfully produced parquet files.
Review Comment:
To use this API in a more general context, we would probably need to define
how these `FileMetaData` mapped to the original files
Maybe it could be something like this (which would also support the API
below)
```rust
HashMap<Path, FileMetaData>
```
##########
datafusion/execution/src/object_store.rs:
##########
@@ -60,6 +60,11 @@ impl ObjectStoreUrl {
pub fn as_str(&self) -> &str {
self.as_ref()
}
+
+ /// Returns as Url
+ pub fn as_url(&self) -> &Url {
Review Comment:
What is the accessor? `AsRef<str>`?
##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -563,13 +566,22 @@ impl DisplayAs for ParquetSink {
impl ParquetSink {
/// Create from config.
pub fn new(config: FileSinkConfig) -> Self {
- Self { config }
+ Self {
+ config,
+ written: Default::default(),
+ }
}
/// Retrieve the inner [`FileSinkConfig`].
pub fn config(&self) -> &FileSinkConfig {
&self.config
}
+
+ /// Retrieve the file metadata from the last write.
+ pub fn written(&self) -> Vec<FileMetaData> {
+ self.written.lock().clone()
+ }
Review Comment:
I think the comments here are out of date (it returns the FileMetaData
written for *all* written files, not just the last one)
As mentioned above, without any additional information I don't think there
is any way to know how to match up the written files and the entries in the
`Vec<FileMetaData>` without some more information
However, if we changed the API to return a HashMap I think it would work well
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]