nuno-faria commented on issue #15582: URL: https://github.com/apache/datafusion/issues/15582#issuecomment-3128489298
@alamb is this still something that would benefit upstream Datafusion? I've implemented the caching of Parquet metadata after noticing that a large amount of time spent on our workloads comes from (re)reading it, as pointed by `metadata_load_time`. This time becomes noticeable in large files that have a large number of pages, where most of the metadata reading times comes from retrieving the page indexes. Here are some results: - Simple reads ``` +----------------------------------------------------+ | 1k simple queries (select where k = ...) | +-----------+------------+---------+-----------------+ | # of Rows | Not cached | Cached | | +-----------+------------+---------+-----------------+ | 100k | 1.9447s | 1.1773s | 1.6517x faster | +-----------+------------+---------+-----------------+ | 100M | 21.2953s | 1.6018s | 13.2943x faster | +-----------+------------+---------+-----------------+ ``` - Partial explain of a `DataSourceExec`: ``` # not cached metrics=[ bytes_scanned=938381 metadata_load_time=12.697101ms time_elapsed_opening=13.0083ms time_elapsed_processing=13.1382ms ... ] # cached metrics=[ bytes_scanned=22488 metadata_load_time=35.801µs time_elapsed_opening=243.4µs time_elapsed_processing=817.2µs ... ] ``` - TPC-H - As for TPC-H, the improvements won't be as noticeable given the queries are more complex, but they will be there. Here are the `metadata_load_times` of a TPC-H query (`sf=10`): ``` # not cached [nation] metadata_load_time=338.501µs [orders] metadata_load_time=95.710812ms [supplier] metadata_load_time=1.478501ms [lineitem] metadata_load_time=424.488412ms [lineitem] metadata_load_time=383.579612ms [lineitem] metadata_load_time=461.110012ms # cached [nation] metadata_load_time=76.001µs [orders] metadata_load_time=15.823212ms [supplier] metadata_load_time=49.201µs [lineitem] metadata_load_time=906.312µs [lineitem] metadata_load_time=6.372212ms [lineitem] metadata_load_time=1.617712ms ``` As for the implementation, it relies on the session's `CacheManager` and a custom `CachedParquetFileReaderFactory`. The metadata caching can be set independently for each Parquet file, using `ParquetReadOptions`, and is invalidated when the underlying file is modified. I would be happy to open a PR, but I still have some questions: - For the `CacheManager` to remain generic, I created a `pub type FileMetadata = dyn Any + Send + Sync;` to represent metadata, which essentially can end up storing anything. Unlike the other information stored (`Statistics` and `ObjectMeta`), there isn't a common type for embedded metadata. Is this the right approach, or should the `CacheManager` be aware of `ParquetMetaData`? - Unlike the other `CacheManager` parameters, which I believe are exclusively user-provided, I think it would make sense for the metadata cache be populated with a `DefaultFilesMetadataCache`, so its easier to enable caching just with `ParquetReadOptions` or `set ...`. Does this make sense? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org