LuciferYang opened a new pull request #33748:
URL: https://github.com/apache/spark/pull/33748


   ### What changes were proposed in this pull request?
   The main purpose of this pr is to introduce the File Meta Cache mechanism 
for Spark SQL and the basic File Meta Cache implementation for Orc is provided 
at the same time.
   
   The main change of this pr as follows:
   
   - Defined a `FileMetaCacheManager` to cache the mapping `FileMetaKey` to 
`FileMeta`. The `FileMetaKey` is the cache key, `equals` is determined by the 
file path by default. The `FileMeta` used to represent the cache value and It 
is generated by the  `FileMetaKey#getFileMeta `  method. 
   
   - Currently, the `FileMetaCacheManager` supports a simple cache expiration 
elimination mechanism, and the expiration time is determined by the new config  
`FILE_META_CACHE_TTL_SINCE_LAST_ACCESS `
   
   - For Orc file format, this pr added `OrcFileMetaKey ` and `OrcFileMeta` to 
cache Orc file Tail and and the Tail cache can be used by Vectorized read scene 
both in DS API V1 and V2, the feature will be enabled when 
`FILE_META_CACHE_ORC_ENABLED ` is true
   
   Currently, the file meta cache mechanism cannot be used by `RowBasedReader`, 
and it needs the completion of 
[ORC-746](https://issues.apache.org/jira/browse/ORC-746) for further support.
   
   ### Why are the changes needed?
   Support Orc datasource use File Meta Cache mechanism to reduce the times of 
metadata reads multiple queries are performed on the same dataset. 
   
   ### Does this PR introduce _any_ user-facing change?
   Add 2 new config:
   
   - `FILE_META_CACHE_ORC_ENABLED(spark.sql.fileMetaCache.orc.enabled)` to 
indicate if enable orc file meta cache mechanism
   - 
`FILE_META_CACHE_TTL_SINCE_LAST_ACCESS(spark.sql.fileMetaCache.ttlSinceLastAccess)`
 to represent Time-to-live for file metadata cache entry after last access, the 
unit is seconds.
   
   ### How was this patch tested?
   
   - Pass the Jenkins or GitHub Action
   - Add new test suites to `OrcQuerySuite`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to