tustvold commented on issue #2230: URL: https://github.com/apache/arrow-rs/issues/2230#issuecomment-1200151568
Ok, so if I understand the issue correctly: * You have a catalog service that identifies the files to scan for a query, along with some metadata * These files are stored in parquet somewhere and can be fetched to memory This is a very similar problem that IOx has and it historically solved this by not using DataFusion's parquet support and using the parquet crate directly, in particular it would: * Fetch files to `Bytes` in memory or a file on disk * Use `parquet` crate directly to scan these "files" using a custom `ExecutionPlan` A while back I created some tickets related to making DataFusion's abstractions more flexible, but I've not yet had time to finish it up * https://github.com/apache/arrow-datafusion/issues/2291 * https://github.com/apache/arrow-datafusion/issues/2293 To me the issue here is that DataFusion's `ParquetExec` is very tightly coupled with both where the data is located and how it is fetched. There are two solutions to this in my mind: * Continue the work to make DataFusion's abstractions more usable * Accelerate plans to lift the object_store logic from DataFusion into `parquet` so that it can be reused by custom ExecutionPlan What do you think? FYI @crepererum @alamb > What do you think about modifying ObjectStore get operations to replace Path with ObjectMeta and adding custom_attributes: Option<Box<[u8]>> to ObjectMeta ? I think this is at odds with the objectives of the crate to provide a consistent abstraction across object stores, and so I would be extremely reticent to change the API in this way -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
