alberto-tamez opened a new issue, #15765:
URL: https://github.com/apache/datafusion/issues/15765

   ### Describe the bug
   
   When DataFusion produces a Substrait plan for a SQL query that performs an 
inner join followed by a projection (ProjectRel), the field indices for the 
right‑side columns are wrong. In our case:
   
   We expect other_int (the 2nd column of the right input) to be referenced by 
index 7
   
   We expect other_flt (the 3rd column of the right input) to be referenced by 
index 8
   
   —but the generated plan uses indices 5 and 6, which map instead to 
pred_input2 and key. Any Substrait consumer reading this plan will extract 
incorrect data.
   
   ### To Reproduce
   
   Prepare your Parquet tables at resources/dummy_table1.parquet and 
resources/dummy_table2.parquet (schema shown below).
   
   Save the following as validate_substrait.py:
   
   #!/usr/bin/env python3
   import json
   from datafusion import SessionContext
   from datafusion.substrait import Serde
   from substrait.proto import Plan as SubstraitProtoPlan
   from google.protobuf.json_format import MessageToDict
   
   T1_PARQUET = "resources/dummy_table1.parquet"
   T2_PARQUET = "resources/dummy_table2.parquet"
   
   def find_project(node):
       if isinstance(node, dict):
           if "project" in node:
               return node["project"]
           for v in node.values():
               p = find_project(v)
               if p is not None: return p
       elif isinstance(node, list):
           for item in node:
               p = find_project(item)
               if p is not None: return p
       return None
   
   def find_field(node):
       if isinstance(node, dict):
           for k, v in node.items():
               if k == "field" and isinstance(v, int):
                   return v
               deeper = find_field(v)
               if deeper is not None: return deeper
       elif isinstance(node, list):
           for item in node:
               deeper = find_field(item)
               if deeper is not None: return deeper
       return None
   
   def main():
       ctx = SessionContext()
       ctx.register_parquet("t1", T1_PARQUET)
       ctx.register_parquet("t2", T2_PARQUET)
   
       sql = """
       SELECT
         t1.key,
         t1.filter_col_int,
         t1.filter_col_float,
         t2.filter_col_num,
         t2.value_col,
         (t1.predict_in1 + t2.value_col) AS project_col
       FROM t1
       INNER JOIN t2 ON t1.key = t2.key
       WHERE
         t1.filter_col_int > 0
         AND (t2.value_col IS NULL OR t2.value_col < 40)
         AND t1.filter_col_float IS NOT NULL
         AND t2.value_col IS NOT NULL
       """
   
       sub_plan = Serde.serialize_to_plan(sql, ctx)
       proto_bytes = sub_plan.encode()
   
       p = SubstraitProtoPlan()
       p.ParseFromString(proto_bytes)
       plan = MessageToDict(p, preserving_proto_field_name=True)
   
       proj = find_project(plan)
       if proj is None:
           raise RuntimeError("No ProjectRel found!")
   
       print("\nSubstrait ProjectRel.expressions:")
       for i, expr in enumerate(proj.get("expressions", [])):
           if "selection" in expr:
               idx = find_field(expr["selection"]) or 0
               print(f"  Expr[{i}]: selection → index = {idx}")
           else:
               fn = expr.get("scalar_function", {}).get("function_reference", 
"?")
               print(f"  Expr[{i}]: scalar_function → func_ref={fn}")
   
       print(
         "\n❗ You should see:\n"
         "   Expr[3] = 5  (WRONG; expected 7 for filter_col_num)\n"
         "   Expr[4] = 6  (WRONG; expected 8 for value_col)"
       )
   
   if __name__ == "__main__":
       main()
   
   ==
   
   Install dependencies and run:
   
   pip install datafusion python-substrait protobuf
   chmod +x validate_substrait.py
   ./validate_substrait.py
   
   **Observe output showing Expr[3] = 5 and Expr[4] = 6 instead of 7 and 8.**
   
   
   
   ### Expected behavior
   
   Expr[3] should be 7 (pointing at other_int)
   
   Expr[4] should be 8 (pointing at other_flt)
   
   Downstream consumers reconstruct the final schema as
   key, filter_col_int, filter_col_float, other_int, other_flt, project_col
   
   
   
   ### Additional context
   
   DataFusion version: v45.2.0
   
   Relevant PR: #12495 added explicit output_mapping to ProjectRel, but it 
appears not to cover post‑join projections.
   
   Related issues:
   
   COUNT schema mismatch (#10873)
   
   Empty‐args in AggregateRel (#15344)
   
   Grouping expression misalignment (#14348)
   
   Substrait spec: https://substrait.io/ (see ProjectRel and output_mapping)
   
   This bug blocks any Substrait consumer from correctly reading joined 
projections without custom hacks.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to