hudi-agent commented on code in PR #18726:
URL: https://github.com/apache/hudi/pull/18726#discussion_r3248083753


##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala:
##########
@@ -620,3 +624,54 @@ case class HoodiePostAnalysisRule(sparkSession: 
SparkSession) extends Rule[Logic
     }
   }
 }
+
+/**
+ * Stamps a synthesized [[CatalogTable]] (table name, base path, schema) onto 
path-based
+ * Hudi reads whose underlying file index is incremental or CDC. Without it, 
lineage and
+ * governance tooling sees `LogicalRelation.catalogTable = None` and falls 
back to the
+ * relation's class name as the dataset identifier -- useless for tracking 
which table
+ * an incremental query came from.
+ *
+ * Scope is intentionally limited to incremental and CDC reads:
+ *  - Catalog-registered reads already have `catalogTable` populated.
+ *  - Path-based snapshot reads have a working file-path-based fallback in 
existing
+ *    lineage tooling; changing their behavior is a separate decision.
+ */
+object HoodieIncrementalRelationIdentifier extends Rule[LogicalPlan] {

Review Comment:
   🤖 nit: the object handles both incremental and CDC reads (per 
`isIncrementalOrCDC`), but the name only mentions `Incremental`. Could you 
rename to something like `HoodieIncrementalAndCDCRelationIdentifier` (or just 
`HoodiePathBasedRelationIdentifier`) so future readers searching for CDC 
behavior find it here?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala:
##########
@@ -620,3 +624,54 @@ case class HoodiePostAnalysisRule(sparkSession: 
SparkSession) extends Rule[Logic
     }
   }
 }
+
+/**
+ * Stamps a synthesized [[CatalogTable]] (table name, base path, schema) onto 
path-based
+ * Hudi reads whose underlying file index is incremental or CDC. Without it, 
lineage and
+ * governance tooling sees `LogicalRelation.catalogTable = None` and falls 
back to the
+ * relation's class name as the dataset identifier -- useless for tracking 
which table
+ * an incremental query came from.
+ *
+ * Scope is intentionally limited to incremental and CDC reads:
+ *  - Catalog-registered reads already have `catalogTable` populated.
+ *  - Path-based snapshot reads have a working file-path-based fallback in 
existing
+ *    lineage tooling; changing their behavior is a separate decision.
+ */
+object HoodieIncrementalRelationIdentifier extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan =
+    AnalysisHelper.allowInvokingTransformsInAnalyzer {
+      plan transform {
+        // Type pattern + guard avoids destructuring `LogicalRelation`, whose 
case-class
+        // arity differs between Spark 3.x (4 args) and Spark 4.x (5 args). 
This rule
+        // lives in `hudi-spark`, which is compiled against every supported 
profile.
+        case lr: LogicalRelation
+            if lr.catalogTable.isEmpty
+              && lr.relation.isInstanceOf[HadoopFsRelation]
+              && 
isIncrementalOrCDC(lr.relation.asInstanceOf[HadoopFsRelation].location) =>
+          val fsRelation = lr.relation.asInstanceOf[HadoopFsRelation]

Review Comment:
   🤖 nit: `lr.relation.asInstanceOf[HadoopFsRelation]` is repeated three times 
across the guard, the rebinding, and the `metaClient` extraction. Could you 
bind it once (e.g. an `@` pattern or a `val` after the match) so the cast 
appears in only one place?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to