IvanVergiliev commented on issue #24068: [SPARK-27105][SQL] Optimize away 
exponential complexity in ORC predicate conversion
URL: https://github.com/apache/spark/pull/24068#issuecomment-475575330
 
 
   @cloud-fan I pushed a new version that has the same complexity and behaviour 
but doesn't need the `FilterWithConjunctPushdown` class. This should help 
eliminate some of the complexity.
   
   While I agree that building an `ExpressionTree` directly might result in 
easier to follow code, I also think it's a much riskier change than this one. A 
couple of reasons:
   1. Without being completely familiar with the Hive codebase, I can't be 
entirely confident that the transformations performed by the 
`SearchArgumentBuilder` are strictly for optimization purposes. There might be 
other parts of the codebase that depend on the fact that a `SearchArgument` 
will have certain properties - which are achieved by creating it from the 
builder. Using reflection to create the underlying representation directly is 
obviously not the intended purpose of this code and I'm not confident it will 
be correct.
   2. This PR as it is now doesn't change any external behaviour at all. The 
resulting ORC `SearchArgument`s will be exactly the same as they were before 
the change. Using `ExpressionTree` with reflection instead will be a full 
rewrite, and especially combined with 1. above, it can generate 
`SearchArgument`s that are different from the current ones.
   
   So, even if the changes introduced because of reasons 1. and 2. above are 
both correct, they could still end up producing ORC predicates that are less 
efficient than the current ones because they haven't been transformed / 
optimized in the proper way by the builder, and thus introduce performance 
regressions.
   
   I'm okay taking a stab at the `ExpressionTree` change, but I'm curious to 
hear more about which parts of the current solution you find complicated. The 
way I think about it, this change contains the following at a high level:
   
   1. Split the `check` and `build` phases of creating a `SearchArgument`. Even 
this change on its own is enough to fix the complexity, taking it from 
exponential to quadratic. And while it introduces some code duplication, I 
think it doesn't introduce additional complexity - I actually find it easier to 
understand the behaviour this way, with the checking and the building 
separated, and without non-obvious effects resulting in exponential complexity.
   
   2. Memoize the results from `check`. The results from `check` don't change, 
so they don't need to be checked multiple times per node.
   
   After getting rid of the `FilterWithConjunctPushdown` class in the latest 
version, this is pretty much all that is left here. We can get rid of the 
memoization as well, but I don't think that's worth it - it's < 10 lines of 
obvious code, and it makes the complexity easier to analyze.
   
   Let me know what you think about the newest version and my thoughts above, 
and thanks for taking a look!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to