alberto-tamez opened a new issue, #15765: URL: https://github.com/apache/datafusion/issues/15765
### Describe the bug When DataFusion produces a Substrait plan for a SQL query that performs an inner join followed by a projection (ProjectRel), the field indices for the right‑side columns are wrong. In our case: We expect other_int (the 2nd column of the right input) to be referenced by index 7 We expect other_flt (the 3rd column of the right input) to be referenced by index 8 —but the generated plan uses indices 5 and 6, which map instead to pred_input2 and key. Any Substrait consumer reading this plan will extract incorrect data. ### To Reproduce Prepare your Parquet tables at resources/dummy_table1.parquet and resources/dummy_table2.parquet (schema shown below). Save the following as validate_substrait.py: #!/usr/bin/env python3 import json from datafusion import SessionContext from datafusion.substrait import Serde from substrait.proto import Plan as SubstraitProtoPlan from google.protobuf.json_format import MessageToDict T1_PARQUET = "resources/dummy_table1.parquet" T2_PARQUET = "resources/dummy_table2.parquet" def find_project(node): if isinstance(node, dict): if "project" in node: return node["project"] for v in node.values(): p = find_project(v) if p is not None: return p elif isinstance(node, list): for item in node: p = find_project(item) if p is not None: return p return None def find_field(node): if isinstance(node, dict): for k, v in node.items(): if k == "field" and isinstance(v, int): return v deeper = find_field(v) if deeper is not None: return deeper elif isinstance(node, list): for item in node: deeper = find_field(item) if deeper is not None: return deeper return None def main(): ctx = SessionContext() ctx.register_parquet("t1", T1_PARQUET) ctx.register_parquet("t2", T2_PARQUET) sql = """ SELECT t1.key, t1.filter_col_int, t1.filter_col_float, t2.filter_col_num, t2.value_col, (t1.predict_in1 + t2.value_col) AS project_col FROM t1 INNER JOIN t2 ON t1.key = t2.key WHERE t1.filter_col_int > 0 AND (t2.value_col IS NULL OR t2.value_col < 40) AND t1.filter_col_float IS NOT NULL AND t2.value_col IS NOT NULL """ sub_plan = Serde.serialize_to_plan(sql, ctx) proto_bytes = sub_plan.encode() p = SubstraitProtoPlan() p.ParseFromString(proto_bytes) plan = MessageToDict(p, preserving_proto_field_name=True) proj = find_project(plan) if proj is None: raise RuntimeError("No ProjectRel found!") print("\nSubstrait ProjectRel.expressions:") for i, expr in enumerate(proj.get("expressions", [])): if "selection" in expr: idx = find_field(expr["selection"]) or 0 print(f" Expr[{i}]: selection → index = {idx}") else: fn = expr.get("scalar_function", {}).get("function_reference", "?") print(f" Expr[{i}]: scalar_function → func_ref={fn}") print( "\n❗ You should see:\n" " Expr[3] = 5 (WRONG; expected 7 for filter_col_num)\n" " Expr[4] = 6 (WRONG; expected 8 for value_col)" ) if __name__ == "__main__": main() == Install dependencies and run: pip install datafusion python-substrait protobuf chmod +x validate_substrait.py ./validate_substrait.py **Observe output showing Expr[3] = 5 and Expr[4] = 6 instead of 7 and 8.** ### Expected behavior Expr[3] should be 7 (pointing at other_int) Expr[4] should be 8 (pointing at other_flt) Downstream consumers reconstruct the final schema as key, filter_col_int, filter_col_float, other_int, other_flt, project_col ### Additional context DataFusion version: v45.2.0 Relevant PR: #12495 added explicit output_mapping to ProjectRel, but it appears not to cover post‑join projections. Related issues: COUNT schema mismatch (#10873) Empty‐args in AggregateRel (#15344) Grouping expression misalignment (#14348) Substrait spec: https://substrait.io/ (see ProjectRel and output_mapping) This bug blocks any Substrait consumer from correctly reading joined projections without custom hacks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org