[GitHub] spark issue #15539: [SPARK-17994] [SQL] Add back a file status cache for cat...

cloud-fan Fri, 21 Oct 2016 06:41:13 -0700

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/15539
  
    We have REFRESH TABLE/PATH because we cache things, so I think we should 
consider caching and refreshing together. Currently we have 4 caches:
    
    1. **table name to `LogicalRelation` cache**: It's kept in 
`HiveMetastoreCatalog`, we won't touch it in this PR.
    2. **table name to partition directories**: It's kept in 
`ListingFileCatalog` and is data source table only, as hive table persists this 
information in metastore. #15515 is going to follow hive and remove this cache 
for data source table.
    3. **partition directory to file status**: It's also kept in 
`ListingFileCatalog` and is data source table only. This PR is going to add it 
to data source tables that are converted from hive tables, and make it global.
    4. **table relation to `InMemoryRelation`**: It's kept in `CacheManager`, 
and will only be cached when users call `CACHE TABLE`. We won't touch it in 
this PR.
    
    The current refresh logic:
    
    1. **REFRESH TABLE**: it invalidates all 4 caches, by removing the first 
cache, as `FileCatalog` is a member of `HadoopFsRelation` and 
`HadoopFsRelation` is a member of `LogicalRelation`. Then it asks `CacheManage` 
to invalidate the `InMemoryRelation` cache if it's cached.
    2. **REFRESH PATH**: it finds a table that uses the given path as table 
path or partition paths, and then refresh this table. Actually I don't think 
this is the right behaviour, REFRESH PATH should be more fine-grained, and only 
remove this path from the third cache.
    
    However, if we make the third cache global, we need a little more work to 
make REFRESH TABLE work, because just removing the first cache is not enough. 
We need a way to know which partition directories are cached for the given 
table.
    
    Proposal 1(the current implementation of this PR): In the global "partition 
directory to file status" cache, we use `clientId -> partition dir` as key 
instead of `partition dir`. When we refresh a table, we go through the whole 
map and find cache entries match the `clientId` and remove them. Note that, 
every `TableFileCatalog` has a unique `clientId`.
    
    The biggest problem of this proposal is, invalidating the cache may be slow 
if there are a lot of cache entries.
    
    Proposal 2: we just use `partition dir` as the cache entry key in the 
global cache, and keep the list of cached partition dirs in `TableFileCatalog`. 
When refreshing a table, we get the list of cached partition dir first, and 
then invalidete them one by one in the "partition directory to file status" 
cache.
    
    The biggest problem of this proposal is, if 2 tables share some paths(it's 
possible if we use same paths to add parittions to 2 tables), refreshing one 
table will partially refresh the other table. What is worse, if the other table 
is explicitly cached via CACHE TABLE, we won't invalidate the 
`InMemoryRelation` cache and brings inconsistency.
    
    Proposal 3: do not make the "partition directory to file status" cache 
global in this PR, and think about it later. At least it's not worse than spark 
2.0
    
    Any ideas?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15539: [SPARK-17994] [SQL] Add back a file status cache for cat...

Reply via email to