Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/24041 )

Change subject: IMPALA-14583: Support partial RPC dispatch for Iceberg tables
......................................................................


Patch Set 2:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/24041/1/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java
File fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java:

http://gerrit.cloudera.org:8080/#/c/24041/1/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java@149
PS1, Line 149:      * @param startOffset Index to start from (0-based)
nit: create ret only after fast return path


http://gerrit.cloudera.org:8080/#/c/24041/1/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java@159
PS1, Line 159:
> Both valid points. But given that Csaba's fast past optimization cuts off t
Maybe this could be thought through in a follow up patch.
Having a LinkedHashMap for fileDescMap_ instead of a separate list and map 
could make this a lot simpler, but my understanding is that would lead to and 
extra reference per file.


http://gerrit.cloudera.org:8080/#/c/24041/1/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java@175
PS1, Line 175:       // Iterate through the list range
Isn't ret.size() always endOffset - startOffset? if yes, it would be enough to 
return the map


http://gerrit.cloudera.org:8080/#/c/24041/2/fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java
File fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java:

http://gerrit.cloudera.org:8080/#/c/24041/2/fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java@1230
PS2, Line 1230: ithRetry
With pagination?


http://gerrit.cloudera.org:8080/#/c/24041/2/fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java@1248
PS2, Line 1248:  
resp.table_info.iceberg_table.content_files.isSetTotal_file_count()
It could be treated as an exception if total_file_count is not returned on the 
first call.


http://gerrit.cloudera.org:8080/#/c/24041/2/tests/custom_cluster/test_local_catalog.py
File tests/custom_cluster/test_local_catalog.py:

http://gerrit.cloudera.org:8080/#/c/24041/2/tests/custom_cluster/test_local_catalog.py@791
PS2, Line 791:     for i in range(150):
             :       self.execute_query("insert into {0} values ({1}, 
'data{1}')".format(tbl, i))
How much time does this take? My guess is 150 DMLs would be pretty slow - the 
test could use less files and smaller catalog_partial_fetch_max_files


http://gerrit.cloudera.org:8080/#/c/24041/2/tests/custom_cluster/test_local_catalog.py@798
PS2, Line 798:     # Extract query_id from SHOW FILES to precisely check only 
its log messages
             :     query_id = re.search(r"Query \(id=(.*)\):", 
res.runtime_profile).group(1)
the query id is not really needed as the database itself will be unique


http://gerrit.cloudera.org:8080/#/c/24041/2/tests/custom_cluster/test_local_catalog.py@801
PS2, Line 801:     # Check catalogd logged the truncation warning for final 
SHOW FILES request
             :     # (limit=100, so returns 100/150 on first request with 
offset 0)
             :     self.assert_catalogd_log_contains(
             :         "WARNING", r"{0}\] Returning 100/150 files for Iceberg 
table {1} \(offset 0\). "
             :         "Coordinator will fetch remaining files in follow-up 
requests.".format(
             :             query_id, tbl))
             :
             :     # Check coordinator logged the intermediate fetch
             :     # (request 1 - the initial request returns 100)
             :     self.assert_impalad_log_contains(
             :         "INFO",
             :         r"{0}\] Fetched 100 files for Iceberg table TableMetaRef 
{1}@\d+ "
             :         r"in request 1. Sending follow-up request for more 
files.".format(
             :             query_id, tbl))
             :
             :     # Check coordinator logged fetching all files in multiple 
requests
             :     # TableMetaRef format: "TableMetaRef db.table@version"
             :     self.assert_impalad_log_contains(
             :         "INFO",
             :         r"{0}\] Fetched 150 total files for Iceberg table 
TableMetaRef "
             :         r"{1}@\d+ across 2 requests".format(query_id, tbl))
Could have a utility function for this to use in both tests



--
To view, visit http://gerrit.cloudera.org:8080/24041
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7f2c058b7cc8efc15bac9fe0e91baadbb7b92cbb
Gerrit-Change-Number: 24041
Gerrit-PatchSet: 2
Gerrit-Owner: Mihaly Szjatinya <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Mihaly Szjatinya <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Mon, 09 Mar 2026 07:59:53 +0000
Gerrit-HasComments: Yes

Reply via email to