codope opened a new pull request, #18726:
URL: https://github.com/apache/hudi/pull/18726

   ### Describe the issue this Pull Request addresses
   
   Today, a downstream consumer that walks an analyzed Spark plan to extract 
column lineage hits two dead ends on Hudi:
   
   - **Hudi MERGE is opaque.** `MergeIntoHoodieTableCommand` extends 
`HoodieLeafRunnableCommand`, which forces `children = Nil`. That is the right 
call for the Catalyst optimizer (we don't want generic optimizer rules 
rewriting the source/target subtrees out from under our custom resolution), but 
it also hides the source plan, target plan, and merge condition from every plan 
walker, so lineage tooling sees a leaf with no inputs and reports the merge as 
having no upstream columns. The other Hudi write commands (`Insert`, `Update`, 
`Delete`, `CTAS`) were already migrated to `DataWritingCommand`/`UnaryCommand` 
in #12772, which exposes the source plan via standard `children`, so MERGE is 
the lone gap on master.
   - **Path-based incremental / CDC reads are anonymous.** When a user runs 
`spark.read.format("hudi").option("hoodie.datasource.query.type", 
"incremental").load(path)`, the resulting `LogicalRelation.catalogTable` is 
`None`. Lineage tooling then falls back to the relation's class name 
("HadoopFsRelation") as the dataset identifier, which collides across every 
Hudi incremental read in the job and is useless for tracking which table a 
query came from.
   
   ### Summary and Changelog
   
   Make Hudi's analyzed plans introspectable to lineage tooling that walks 
Catalyst plans (e.g. OpenLineage Spark integration, Atlas, custom analyzer 
extensions, etc.). Two changes:
   
   1. **`MergeIntoHoodieTableCommand.innerChildren`**: expose the analyzed 
`MergeIntoTable` (source plan, target plan, merge condition, matched / 
not-matched actions) without breaking the optimizer leaf contract.
   2. **`HoodieIncrementalRelationIdentifier`**: post-hoc analyzer rule that 
stamps a synthesized `CatalogTable` (table name, base path, schema) onto 
`LogicalRelation`s backed by `HoodieIncrementalFileIndex` / 
`HoodieCDCFileIndex` when the read entered via path (i.e. no catalog 
registration).
   
   Tracking issue: apache/hudi#18298
   
   ### Impact
   
   #### `innerChildren` for MERGE
   
   `TreeNode.innerChildren` is Spark's documented escape hatch for plan nodes 
that need to expose subtrees for display/inspection without participating in 
optimizer traversal. It is *not* visited by `transform` / `mapChildren`, so 
there is no risk of generic Catalyst rules re-writing the inner 
`MergeIntoTable`. It *is* rendered by `EXPLAIN` and is the conventional access 
point for plan walkers. This is the same pattern Spark itself uses for 
`WriteToDataSourceV2Exec` and other commands that wrap a logical plan they 
don't want optimized as a child. The `HoodieLeafLike#children = Nil` contract 
is preserved. This change is a purely additive method override.
   
   #### `HoodieIncrementalRelationIdentifier`
   
   The rule runs as a `customPostHocResolutionRule`, so it sees the resolved 
plan after Hudi's own analyzer rules have built the relation. It only fires 
when **all** of the following hold:
   
     - The node is a `LogicalRelation` over `HadoopFsRelation`
     - `catalogTable` is `None` (catalog-registered reads are left alone)
     - The location is `HoodieIncrementalFileIndex` or `HoodieCDCFileIndex`
   
   When it matches, it pulls the table name, base path, and database name from 
the existing `HoodieTableMetaClient` carried by the existing file index (no 
extra metadata / FS calls), and synthesizes a `CatalogTable` from that plus the 
relation's resolved schema. Database falls back to Spark's `default` when 
`hoodie.database.name` is unset, matching existing path-based read behavior.
   
     **Why scope is limited to incremental and CDC:**
     - Catalog reads already populate `catalogTable`.
     - Path-based snapshot reads have a working file-path-based fallback in 
existing lineage extractors. Whether to enrich them too is a separate decision 
(easy to extend later.
   
   The transform is wrapped in 
`AnalysisHelper.allowInvokingTransformsInAnalyzer { ... }`, matching existing 
convention in `HoodieAnalysis.scala` (e.g. 
`AdaptIngestionTargetLogicalRelations`).
   
   ### Risk Level
   
   Low.
   
   - `innerChildren` is consumed only by `EXPLAIN` rendering and opt-in plan 
walkers; nothing in Hudi's write path or Spark's optimizer traverses it.
   - The analyzer rule is a no-op when `catalogTable` is already set, so 
catalog-registered tables are unaffected.
   - No write path, no public API surface, no config changes.
   
   ### Test plan
   
   Two new Spark integration tests under 
`hudi-spark-datasource/hudi-spark/src/test/scala/.../analysis/`:
   
   `TestMergeIntoHoodieTableCommandInnerChildren`:
   
     - `testMergeIntoExposesAnalyzedMergeIntoTableViaInnerChildren`: asserts 
`cmd.children.isEmpty` (leaf contract preserved) **and** `cmd.innerChildren` 
contains the analyzed `MergeIntoTable` with reachable source/target leaves, 
non-null merge condition, and the expected matched / not-matched actions.
     - `testExplainShowsSourceTableViaInnerChildren`: asserts the source view 
name shows up in `EXPLAIN` output, proving the round-trip through Spark's 
renderer works.
   
   `TestHoodieIncrementalRelationIdentifier`:
   
     - `testPathBasedIncrementalReadGetsCatalogTable` (parameterized over `cow` 
/ `mor`): asserts the rule populates `catalogTable` with the Hudi table name, 
base-path URI, schema, and `provider = "hudi"`.
     - `testCatalogRegisteredIncrementalReadIsNotMutated`: asserts that 
catalog-registered tables keep their original `catalogTable` (rule doesn't 
over-fire).
     - `testSnapshotPathBasedReadIsNotEnriched`: asserts snapshot path reads 
remain untouched (scope is intentional).
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to