Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/8922#issuecomment-145717868
  
    Hmm, because the sql query and data schema is sensitive for company 
business, I may not be able to post publicly here. The data size is hundreds GB 
to 1TB, and the sql query is roughly selecting dozen of columns from the table 
with few filters involving UDFs and lateral view and group by.
    
    I just realized that we don't need to do `CatalystTypeConverters` for UDF 
input here. By removing it I think it should reduce some boxing time? For the 
degenerate cases, if you meant do I test it on the data which mostly doesn't 
satisfy the filter, it is no. However, for such cases, it will add some 
overhead computation cost definitely in the filtering, no matter with or 
without this patch.
    
    You suggestion is correct. But for our cases, because these existing UDFs 
are not always with signatures such as `Any => Boolean` or `Int => Boolean`, 
our filtering condition would be like `where udf(column1) = 'ABCDE...'`. That 
is why I need to widen the API and use more general signature here. I think it 
should be able to deal with the pushdown usage of UDFs. With the single 
Function filter or the specialized variants you suggested, these UDFs will be 
needed to modify to be used for it.
    
    The above is the reason why I designed the API as it in this patch. If you 
still think this API is too general to use, I can update it as you suggest.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to