IvanVergiliev commented on issue #24068: [SPARK-27105][SQL] Optimize away exponential complexity in ORC predicate conversion URL: https://github.com/apache/spark/pull/24068#issuecomment-499964362 @gatorsmile yep, the performance is pretty much the same because the two implementations are exactly the same at a high level. The two main high-level steps are: 1. Trim the Filter tree so that unconvertible filters are removed and we don't need to check before building ORC filters in the next step. 2. Convert the trimmed filter, knowing that it's fully convertible at this point. The main differences are in code structure. At a high level, the two main differences are: 1. Whether the trimming and conversion logic are in the same function, or in two separate functions. I actually had this implemented with two separate functions a few commits back - you can see https://github.com/IvanVergiliev/spark/pull/2/files for when I made the change to merge them in the same function. Notice that it's in a separate PR because I realize it's a controversial change. However, as mentioned back then, I think keeping the filtering and building logic in the same method has the following benefits: > • All the logic for a given predicate is grouped logically in the same place. You don't have to scroll across the whole file to see what the filter action for an And is while you're looking at the build action. > • You can't really add a new predicate to the set of filtered predicates without also defining a Build action for it - this fails the exhaustiveness check on ActionType. @cloud-fan agreed that this does sound like an improvement in https://github.com/apache/spark/pull/24068#issuecomment-494388288 , so I merged the change into the main PR. I still think this has code design benefits over the separate functions, but I'm open to reverting to the previous version if there's consensus that it's better. 2. Whether we call `buildSearchArgument` while trimming the tree. While this provides for a shorter implementation, I think it makes the contract of `buildSearchArgument` very weak. If the trimming and building are independent from each other, we have a clear, well-defined contract for `buildSearchArgument` that's basically this > We always pass a fully convertible Filter, so buildSearchArgument can just depend on that and return a Builder. Instead, if we `buildSearchArgument` in order to trim the filters sometimes as well, we get a way murkier, unclear contract that goes something like: > We mostly pass a convertible Filter, but sometimes we actually pass whatever we get, and depend on `buildSearchArgument` to figure out if it's convertible or not. This leads to all kinds of questions: - if we always trim the Filter first, why do we need to return `Option` when building? - if I want to use the `buildSearchArgument` method and I see that it returns an Option, I might just as well assume that it will handle non-convertible filters just fine. However, this is only partially true - it can handle some of them fine, and can break for others. - There's no clear single responsibility of the `buildSearchArgument` method. The name implies that the only responsibility of the method is to build things, but in fact it's also sometimes used for trimming. So while this saves some code, I think it makes the overall design and separation of concerns much less clear. tl;dr I've gone through similar passes as the one proposed by @gengliangwang, but I've had specific design reasons to move away from them. I'm happy to go back to a version where the filtering and building are written in separate methods if people like this better. I'm more sceptical about the reuse of `buildSearchArgument` for the reasons I explained above, but I can also make this happen if people think it's a better approach. This got pretty long, but it also summarizes ~50 commits of changes and the discussion surrounding them. Let me know if anything is unclear or doesn't make sense.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
