[GitHub] [spark] IvanVergiliev commented on issue #24068: [SPARK-27105][SQL] Optimize away exponential complexity in ORC predicate conversion

GitBox Fri, 07 Jun 2019 10:08:47 -0700

IvanVergiliev commented on issue #24068: [SPARK-27105][SQL] Optimize away 
exponential complexity in ORC predicate conversion
URL: https://github.com/apache/spark/pull/24068#issuecomment-499964362
 
 
   @gatorsmile yep, the performance is pretty much the same because the two 
implementations are exactly the same at a high level. The two main high-level 
steps are:
   1. Trim the Filter tree so that unconvertible filters are removed and we 
don't need to check before building ORC filters in the next step.
   2. Convert the trimmed filter, knowing that it's fully convertible at this 
point.
   
   The main differences are in code structure. At a high level, the two main 
differences are:
   1. Whether the trimming and conversion logic are in the same function, or in 
two separate functions. I actually had this implemented with two separate 
functions a few commits back - you can see 
https://github.com/IvanVergiliev/spark/pull/2/files for when I made the change 
to merge them in the same function. Notice that it's in a separate PR because I 
realize it's a controversial change. However, as mentioned back then, I think 
keeping the filtering and building logic in the same method has the following 
benefits:
   
   > • All the logic for a given predicate is grouped logically in the same 
place. You don't have to scroll across the whole file to see what the filter 
action for an And is while you're looking at the build action.
   > • You can't really add a new predicate to the set of filtered predicates 
without also defining a Build action for it - this fails the exhaustiveness 
check on ActionType.
   
   @cloud-fan agreed that this does sound like an improvement in 
https://github.com/apache/spark/pull/24068#issuecomment-494388288 , so I merged 
the change into the main PR. I still think this has code design benefits over 
the separate functions, but I'm open to reverting to the previous version if 
there's consensus that it's better.
   
   2. Whether we call `buildSearchArgument` while trimming the tree. While this 
provides for a shorter implementation, I think it makes the contract of 
`buildSearchArgument` very weak. If the trimming and building are independent 
from each other, we have a clear, well-defined contract for 
`buildSearchArgument` that's basically this
   > We always pass a fully convertible Filter, so buildSearchArgument can just 
depend on that and return a Builder.
   
   Instead, if we `buildSearchArgument` in order to trim the filters sometimes 
as well, we get a way murkier, unclear contract that goes something like:
   > We mostly pass a convertible Filter, but sometimes we actually pass 
whatever we get, and depend on `buildSearchArgument` to figure out if it's 
convertible or not.
   
   This leads to all kinds of questions:
   
     - if we always trim the Filter first, why do we need to return `Option` 
when building?
     - if I want to use the `buildSearchArgument` method and I see that it 
returns an Option, I might just as well assume that it will handle 
non-convertible filters just fine. However, this is only partially true - it 
can handle some of them fine, and can break for others.
     - There's no clear single responsibility of the `buildSearchArgument` 
method. The name implies that the only responsibility of the method is to build 
things, but in fact it's also sometimes used for trimming.
   
       So while this saves some code, I think it makes the overall design and 
separation of concerns much less clear.
   
   tl;dr I've gone through similar passes as the one proposed by 
@gengliangwang, but I've had specific design reasons to move away from them.
   
   I'm happy to go back to a version where the filtering and building are 
written in separate methods if people like this better. I'm more sceptical 
about the reuse of `buildSearchArgument` for the reasons I explained above, but 
I can also make this happen if people think it's a better approach.
   
   This got pretty long, but it also summarizes ~50 commits of changes and the 
discussion surrounding them. Let me know if anything is unclear or doesn't make 
sense.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] IvanVergiliev commented on issue #24068: [SPARK-27105][SQL] Optimize away exponential complexity in ORC predicate conversion

Reply via email to