Blizzara commented on PR #13888:
URL: https://github.com/apache/datafusion/pull/13888#issuecomment-2560214683

   I like the idea, more testing the better! We already have some Substrait TCP 
testing, but I think that's from "known Substrait" -> DF, so it only tests the 
consumer, while this would test the producer as well.
   
   I think roughly there should be three levels of "equality":
   1. a plan goes through roundtrip without crashing, but doesn't match 
equality (this doesn't necessarily mean the result is correct, but it shows 
that there aren't any totally unsupported LogicalPlan nodes)
   2.  "optimized" plan equality - calling optimize on both plans before 
comparing, like you do.
   3.  "unoptimized" plan equality - comparing the plans directly.
   
   I feel like (2) should be "easier" check than (3), though the tests you have 
here show that some plans would pass unoptimized but not optimized, which 
confuses me. 
   
   Overall, I think (3) is nice to have, but not something I'd spent too much 
effort for - it doesn't really bring that much benefits to users compared to 
(2). (2) is nice compared to (1) in that it guarantees a bit more correctness, 
but in many cases it's unfortunately tough to support even (2) (even in theory, 
roundtrip isn't lossless, since multiple DF plans may result in the same 
Substrait plan, and the other way around too).
   
   So I'd suggest splitting the TPCH tests into 4 categories:
   - those that pass (3)
   - those that pass (2)
   - those that pass (1) 
     - maybe eyeballed into a) those that look correct after roundtrip even if 
the plan doesn't fully match, and b) those that look incorrect, if any
   - those that fail to convert for the roundtrip
   
   That allows merging the tests, preventing regressions, and tracking needed 
improvements :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to