Github user viirya commented on the pull request:
https://github.com/apache/spark/pull/8922#issuecomment-146215917
This implementation only considers the use case to evaluate a single
attribute with an UDF and compare the result with a literal value. We only
consider this because in the current implementation of `selectFilters` in
`DataSourceStrategy`, only the predicates involving an attribute and a literal
value (i.e., `col = 1`, `col2 > 2`, etc.) are selected to be the candidates for
pushing down. Besides, the form like `udf(column) = 'ABCDE...'` is mostly
common and widely used in our SQL queries involving UDFs in filtering condition.
Your proposal looks good and very general. However, I am little worrying
the performance regressions brought by creating a row for each input value and
evaluating on the row.
This patch helps us reduce the memory footprint required for loading lot of
data from Parquet files. For performance, the improvement is not significant
but competes with the case of not pushing down at least.
I agree that this patch introduces additional complexity to the API. If you
still think it is not worth, I will close this PR first.
Thanks for reviewing and suggestion.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]