IvanVergiliev commented on issue #24068: [SPARK-27105][SQL] Optimize away exponential complexity in ORC predicate conversion URL: https://github.com/apache/spark/pull/24068#issuecomment-475575330 @cloud-fan I pushed a new version that has the same complexity and behaviour but doesn't need the `FilterWithConjunctPushdown` class. This should help eliminate some of the complexity. While I agree that building an `ExpressionTree` directly might result in easier to follow code, I also think it's a much riskier change than this one. A couple of reasons: 1. Without being completely familiar with the Hive codebase, I can't be entirely confident that the transformations performed by the `SearchArgumentBuilder` are strictly for optimization purposes. There might be other parts of the codebase that depend on the fact that a `SearchArgument` will have certain properties - which are achieved by creating it from the builder. Using reflection to create the underlying representation directly is obviously not the intended purpose of this code and I'm not confident it will be correct. 2. This PR as it is now doesn't change any external behaviour at all. The resulting ORC `SearchArgument`s will be exactly the same as they were before the change. Using `ExpressionTree` with reflection instead will be a full rewrite, and especially combined with 1. above, it can generate `SearchArgument`s that are different from the current ones. So, even if the changes introduced because of reasons 1. and 2. above are both correct, they could still end up producing ORC predicates that are less efficient than the current ones because they haven't been transformed / optimized in the proper way by the builder, and thus introduce performance regressions. I'm okay taking a stab at the `ExpressionTree` change, but I'm curious to hear more about which parts of the current solution you find complicated. The way I think about it, this change contains the following at a high level: 1. Split the `check` and `build` phases of creating a `SearchArgument`. Even this change on its own is enough to fix the complexity, taking it from exponential to quadratic. And while it introduces some code duplication, I think it doesn't introduce additional complexity - I actually find it easier to understand the behaviour this way, with the checking and the building separated, and without non-obvious effects resulting in exponential complexity. 2. Memoize the results from `check`. The results from `check` don't change, so they don't need to be checked multiple times per node. After getting rid of the `FilterWithConjunctPushdown` class in the latest version, this is pretty much all that is left here. We can get rid of the memoization as well, but I don't think that's worth it - it's < 10 lines of obvious code, and it makes the complexity easier to analyze. Let me know what you think about the newest version and my thoughts above, and thanks for taking a look!
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
