yujun777 opened a new pull request, #63470:
URL: https://github.com/apache/doris/pull/63470
### What problem does this PR solve?
Hive external table row count estimation can read Hive Metastore metadata
without filling Doris' Hive external metadata cache. This makes a normal query
pay duplicate HMS metadata access in the same planning flow.
The problematic path is:
1. A normal query asks the external table for row count through
`ExternalTable.getRowCount()`.
2. `ExternalRowCountCache` misses and calls
`HMSExternalTable.fetchRowCount()`.
3. If HMS table parameters do not contain row count and
`enable_get_row_count_from_file_list` is enabled, `HMSExternalTable` estimates
row count from file list.
4. Before this PR, that estimation path always used
`getAllPartitionsWithoutCache()` and `getFilesByPartitions(...,
withCache=false, ...)`.
5. Later in the same normal query, scan planning still needs partition and
file metadata, so it reads the same HMS/file metadata again through the normal
cached scan path.
This behavior was originally useful for non-query metadata display requests
such as `show table status`, `show stats`, and `information_schema.tables`:
those requests should not fill heavy Hive metadata caches just because they
display cached row count. However, normal query planning is different. If row
count estimation has already fetched partition and file metadata, filling the
metadata cache avoids duplicated HMS reads in the following scan planning step.
### How was it fixed?
This PR separates the two row-count loading modes with an explicit
`fillMetaCache` flag:
- Normal query row-count loading uses `fillMetaCache=true`.
- Cached row-count display paths such as `getCachedRowCount()` still use
`fillMetaCache=false`.
- The default async row-count cache loader keeps the existing non-filling
behavior unless the caller explicitly requests cache filling.
- `HMSExternalTable` routes row-count file-list estimation through cached or
non-cached Hive metadata APIs based on `fillMetaCache`.
Concretely:
- `ExternalTable.getRowCount()` now requests
`ExternalRowCountCache.getCachedRowCount(..., true)`.
- `ExternalTable.getCachedRowCount()` and
`PluginDrivenExternalTable.getCachedRowCount()` request `false`.
- `ExternalRowCountCache` loads row count through
`ExternalTable.fetchRowCountWithMetaCache(fillMetaCache)`.
- `HMSExternalTable.fetchRowCount()` remains the lightweight non-filling
path.
- `HMSExternalTable.fetchRowCountWithMetaCache(true)` fills Hive
partition/file metadata cache while estimating row count from file list.
This keeps the previous optimization for show/stat display paths while
allowing normal queries to reuse metadata fetched during row-count estimation.
### Check List
- Test:
- Unit Test: `ExternalRowCountCacheTest`
- Unit Test: `HMSExternalTableTest`
- Behavior changed: Yes. Normal query row-count estimation for HMS external
tables may now fill Hive metadata cache when it has to estimate row count from
file list. Non-query cached row-count display paths still avoid filling heavy
metadata caches.
- Does this need documentation: No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]