yujun777 opened a new pull request, #63470:
URL: https://github.com/apache/doris/pull/63470

   ### What problem does this PR solve?
   
   Hive external table row count estimation can read Hive Metastore metadata 
without filling Doris' Hive external metadata cache. This makes a normal query 
pay duplicate HMS metadata access in the same planning flow.
   
   The problematic path is:
   
   1. A normal query asks the external table for row count through 
`ExternalTable.getRowCount()`.
   2. `ExternalRowCountCache` misses and calls 
`HMSExternalTable.fetchRowCount()`.
   3. If HMS table parameters do not contain row count and 
`enable_get_row_count_from_file_list` is enabled, `HMSExternalTable` estimates 
row count from file list.
   4. Before this PR, that estimation path always used 
`getAllPartitionsWithoutCache()` and `getFilesByPartitions(..., 
withCache=false, ...)`.
   5. Later in the same normal query, scan planning still needs partition and 
file metadata, so it reads the same HMS/file metadata again through the normal 
cached scan path.
   
   This behavior was originally useful for non-query metadata display requests 
such as `show table status`, `show stats`, and `information_schema.tables`: 
those requests should not fill heavy Hive metadata caches just because they 
display cached row count. However, normal query planning is different. If row 
count estimation has already fetched partition and file metadata, filling the 
metadata cache avoids duplicated HMS reads in the following scan planning step.
   
   ### How was it fixed?
   
   This PR separates the two row-count loading modes with an explicit 
`fillMetaCache` flag:
   
   - Normal query row-count loading uses `fillMetaCache=true`.
   - Cached row-count display paths such as `getCachedRowCount()` still use 
`fillMetaCache=false`.
   - The default async row-count cache loader keeps the existing non-filling 
behavior unless the caller explicitly requests cache filling.
   - `HMSExternalTable` routes row-count file-list estimation through cached or 
non-cached Hive metadata APIs based on `fillMetaCache`.
   
   Concretely:
   
   - `ExternalTable.getRowCount()` now requests 
`ExternalRowCountCache.getCachedRowCount(..., true)`.
   - `ExternalTable.getCachedRowCount()` and 
`PluginDrivenExternalTable.getCachedRowCount()` request `false`.
   - `ExternalRowCountCache` loads row count through 
`ExternalTable.fetchRowCountWithMetaCache(fillMetaCache)`.
   - `HMSExternalTable.fetchRowCount()` remains the lightweight non-filling 
path.
   - `HMSExternalTable.fetchRowCountWithMetaCache(true)` fills Hive 
partition/file metadata cache while estimating row count from file list.
   
   This keeps the previous optimization for show/stat display paths while 
allowing normal queries to reuse metadata fetched during row-count estimation.
   
   ### Check List
   
   - Test:
     - Unit Test: `ExternalRowCountCacheTest`
     - Unit Test: `HMSExternalTableTest`
   - Behavior changed: Yes. Normal query row-count estimation for HMS external 
tables may now fill Hive metadata cache when it has to estimate row count from 
file list. Non-query cached row-count display paths still avoid filling heavy 
metadata caches.
   - Does this need documentation: No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to