Github user viirya commented on the pull request:
https://github.com/apache/spark/pull/8922#issuecomment-145717868
Hmm, because the sql query and data schema is sensitive for company
business, I may not be able to post publicly here. The data size is hundreds GB
to 1TB, and the sql query is roughly selecting dozen of columns from the table
with few filters involving UDFs and lateral view and group by.
I just realized that we don't need to do `CatalystTypeConverters` for UDF
input here. By removing it I think it should reduce some boxing time? For the
degenerate cases, if you meant do I test it on the data which mostly doesn't
satisfy the filter, it is no. However, for such cases, it will add some
overhead computation cost definitely in the filtering, no matter with or
without this patch.
You suggestion is correct. But for our cases, because these existing UDFs
are not always with signatures such as `Any => Boolean` or `Int => Boolean`,
our filtering condition would be like `where udf(column1) = 'ABCDE...'`. That
is why I need to widen the API and use more general signature here. I think it
should be able to deal with the pushdown usage of UDFs. With the single
Function filter or the specialized variants you suggested, these UDFs will be
needed to modify to be used for it.
The above is the reason why I designed the API as it in this patch. If you
still think this API is too general to use, I can update it as you suggest.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]