Re: [PR] [fix](multi-catalog) fixes some issues caused by the data_lake_reader refactoring #62306 and legacy issues [doris]

via GitHub Tue, 02 Jun 2026 18:00:55 -0700


github-actions[bot] commented on code in PR #62821:
URL: https://github.com/apache/doris/pull/62821#discussion_r3345327921



##########
be/src/exec/scan/file_scanner.cpp:
##########
@@ -1689,6 +1664,10 @@ Status FileScanner::_generate_partition_columns() {
     if (!range.__isset.columns_from_path_keys) {
         return Status::OK();
     }
+    DORIS_CHECK(range.__isset.columns_from_path);
+    DORIS_CHECK(range.__isset.columns_from_path_is_null);

Review Comment:
   `columns_from_path_is_null` is a newly-added optional Thrift field, but this 
makes it mandatory whenever `columns_from_path_keys` is present. During a 
rolling upgrade, an old FE still sends 
`columns_from_path_keys`/`columns_from_path` without this new field for 
Hive/Hudi/load scan ranges, and a new BE will hit this `DORIS_CHECK` and 
crash/fail the scanner before reading any data. The old code explicitly 
tolerated the field being absent, so please keep a compatibility branch here 
(and in the other BE extraction paths that added the same check) that treats 
missing null flags as all non-null or otherwise handles old FE scan ranges 
safely.



##########
fe/fe-core/src/main/java/org/apache/doris/datasource/iceberg/IcebergUtils.java:
##########
@@ -704,6 +705,57 @@ public static Map<String, String> 
getPartitionInfoMap(PartitionData partitionDat
         return partitionInfoMap;
     }
 
+    public static List<String> getIdentityPartitionColumns(Table table) {
+        LinkedHashSet<String> partitionColumns = new LinkedHashSet<>();
+        for (PartitionSpec spec : table.specs().values()) {
+            for (PartitionField partitionField : spec.fields()) {

Review Comment:
   This returns the union of identity partition columns across every Iceberg 
partition spec, but `IcebergScanNode` uses that list to classify slots for the 
whole scan while each split only sets `columns_from_path` for its own spec. If 
a table evolves from spec `identity(dt)` to `identity(dt), identity(region)`, 
files written under the old spec do not have a `region` partition value; 
nevertheless `region` is classified as `PARTITION_KEY`, so BE will not read 
`region` from the data file for those old files and will return a 
missing/default value instead of the stored column. Please make the 
classification safe for per-spec evolution, for example only treating a column 
as metadata-filled when every planned split/spec can provide that identity 
value, or keep such columns as regular file columns unless the split has the 
metadata value. Add a test with multiple Iceberg specs where an identity 
partition column is added after existing files.



##########
regression-test/suites/external_table_p0/paimon/test_paimon_partition_table.groovy:
##########
@@ -52,6 +52,74 @@ suite("test_paimon_partition_table", "p0,external") {
            String baseQueryName = "qt_show_partition_${tableName}"
            "$baseQueryName" """show partitions from ${tableName};"""
        }
+
+        sql """set force_jni_scanner=true"""
+
+        def salesByDateRows = sql """select sale_date from sales_by_date order 
by sale_date"""
+        assertEquals(

Review Comment:
   For deterministic query results, the regression-test standard in `AGENTS.md` 
asks to use `qt_*`/`order_qt_*` and generated `.out` files rather than Groovy 
`assertEquals`. These new fixed-result checks will not be reflected in the 
expected-output file and are inconsistent with the rest of this suite's 
`qt_show_partition_*` pattern. Please convert these result checks to ordered 
`qt` cases (or add a clear reason why assertion-based validation is necessary).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [fix](multi-catalog) fixes some issues caused by the data_lake_reader refactoring #62306 and legacy issues [doris]

Reply via email to