Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/15539
We have REFRESH TABLE/PATH because we cache things, so I think we should
consider caching and refreshing together. Currently we have 4 caches:
1. **table name to `LogicalRelation` cache**: It's kept in
`HiveMetastoreCatalog`, we won't touch it in this PR.
2. **table name to partition directories**: It's kept in
`ListingFileCatalog` and is data source table only, as hive table persists this
information in metastore. #15515 is going to follow hive and remove this cache
for data source table.
3. **partition directory to file status**: It's also kept in
`ListingFileCatalog` and is data source table only. This PR is going to add it
to data source tables that are converted from hive tables, and make it global.
4. **table relation to `InMemoryRelation`**: It's kept in `CacheManager`,
and will only be cached when users call `CACHE TABLE`. We won't touch it in
this PR.
The current refresh logic:
1. **REFRESH TABLE**: it invalidates all 4 caches, by removing the first
cache, as `FileCatalog` is a member of `HadoopFsRelation` and
`HadoopFsRelation` is a member of `LogicalRelation`. Then it asks `CacheManage`
to invalidate the `InMemoryRelation` cache if it's cached.
2. **REFRESH PATH**: it finds a table that uses the given path as table
path or partition paths, and then refresh this table. Actually I don't think
this is the right behaviour, REFRESH PATH should be more fine-grained, and only
remove this path from the third cache.
However, if we make the third cache global, we need a little more work to
make REFRESH TABLE work, because just removing the first cache is not enough.
We need a way to know which partition directories are cached for the given
table.
Proposal 1(the current implementation of this PR): In the global "partition
directory to file status" cache, we use `clientId -> partition dir` as key
instead of `partition dir`. When we refresh a table, we go through the whole
map and find cache entries match the `clientId` and remove them. Note that,
every `TableFileCatalog` has a unique `clientId`.
The biggest problem of this proposal is, invalidating the cache may be slow
if there are a lot of cache entries.
Proposal 2: we just use `partition dir` as the cache entry key in the
global cache, and keep the list of cached partition dirs in `TableFileCatalog`.
When refreshing a table, we get the list of cached partition dir first, and
then invalidete them one by one in the "partition directory to file status"
cache.
The biggest problem of this proposal is, if 2 tables share some paths(it's
possible if we use same paths to add parittions to 2 tables), refreshing one
table will partially refresh the other table. What is worse, if the other table
is explicitly cached via CACHE TABLE, we won't invalidate the
`InMemoryRelation` cache and brings inconsistency.
Proposal 3: do not make the "partition directory to file status" cache
global in this PR, and think about it later. At least it's not worse than spark
2.0
Any ideas?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]