nuno-faria commented on issue #15582:
URL: https://github.com/apache/datafusion/issues/15582#issuecomment-3128489298

   @alamb is this still something that would benefit upstream Datafusion?
   
   I've implemented the caching of Parquet metadata after noticing that a large 
amount of time spent on our workloads comes from (re)reading it, as pointed by 
`metadata_load_time`. This time becomes noticeable in large files that have a 
large number of pages, where most of the metadata reading times comes from 
retrieving the page indexes.
   
   Here are some results:
   
   - Simple reads
     ```
     +----------------------------------------------------+
     |      1k simple queries (select where k = ...)      |
     +-----------+------------+---------+-----------------+
     | # of Rows | Not cached |  Cached |                 |
     +-----------+------------+---------+-----------------+
     |    100k   |   1.9447s  | 1.1773s |  1.6517x faster |
     +-----------+------------+---------+-----------------+
     |    100M   |  21.2953s  | 1.6018s | 13.2943x faster |
     +-----------+------------+---------+-----------------+
     ```
   
   - Partial explain of a `DataSourceExec`:
     ```
     # not cached
     metrics=[
         bytes_scanned=938381
         metadata_load_time=12.697101ms
         time_elapsed_opening=13.0083ms
         time_elapsed_processing=13.1382ms
         ...
     ]
     
     # cached
     metrics=[
         bytes_scanned=22488
         metadata_load_time=35.801µs
         time_elapsed_opening=243.4µs
         time_elapsed_processing=817.2µs
         ...
     ]
     ```
   
   - TPC-H
     - As for TPC-H, the improvements won't be as noticeable given the queries 
are more complex, but they will be there. Here are the `metadata_load_times` of 
a TPC-H query (`sf=10`):
       ```
       # not cached
       [nation]    metadata_load_time=338.501µs
       [orders]    metadata_load_time=95.710812ms
       [supplier]  metadata_load_time=1.478501ms
       [lineitem]  metadata_load_time=424.488412ms
       [lineitem]  metadata_load_time=383.579612ms
       [lineitem]  metadata_load_time=461.110012ms
   
       # cached
       [nation]    metadata_load_time=76.001µs
       [orders]    metadata_load_time=15.823212ms
       [supplier]  metadata_load_time=49.201µs
       [lineitem]  metadata_load_time=906.312µs
       [lineitem]  metadata_load_time=6.372212ms
       [lineitem]  metadata_load_time=1.617712ms
       ```
   
   As for the implementation, it relies on the session's `CacheManager` and a 
custom `CachedParquetFileReaderFactory`. The metadata caching can be set 
independently for each Parquet file, using `ParquetReadOptions`, and is 
invalidated when the underlying file is modified. I would be happy to open a 
PR, but I still have some questions:
     - For the `CacheManager` to remain generic, I created a `pub type 
FileMetadata = dyn Any + Send + Sync;` to represent metadata, which essentially 
can end up storing anything. Unlike the other information stored (`Statistics` 
and `ObjectMeta`), there isn't a common type for embedded metadata. Is this the 
right approach, or should the `CacheManager` be aware of `ParquetMetaData`?
     - Unlike the other `CacheManager` parameters, which I believe are 
exclusively user-provided, I think it would make sense for the metadata cache 
be populated with a `DefaultFilesMetadataCache`, so its easier to enable 
caching just with `ParquetReadOptions` or `set ...`. Does this make sense?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to