[ https://issues.apache.org/jira/browse/HIVE-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Siying Dong updated HIVE-1638: ------------------------------ Attachment: HIVE-1638.1.patch Write GenericUDF functions for logical and, or, not, comparison operation equal, not equal, greater, less, not greater, not less. Remove respective UDFs. Make other codes changes to turn to use the new functions. I ran some sample queries and didn't find performance regression in any of those queries. Then I measure improvement against some normal queries whose performance this change is expected to improve (basically queries with some filters, especially string comparison). Sample queries were executed against `source_table`, which has the same data-set as a production table. `source_table` is a table with 422 files, total size 127,881,234,652 bytes. Compressed using RCFormat. It has 18 non-partition columns. ds is the partition column.Partition ds='2010-09-23' has about 5600M rows. Values of column `group` in most rows are "wizard_generate_new" (`group`="wizard_generate_new" is not very selective). f_c is a column whose values are widely spread. '5015', '4960', '2100', '2144' and '1451' are some values that have thousands of rows (f_c='xx' is very selective). Split size was set to a value so that 87 mappers were used for all the queries. query1: select count(1) from source_table where f_c='5015' and `group`='wizard_generate_new' and ds='2010-09-23' query2: select count(1) from source_table where ds='2010-09-23' and `group`='wizard_generate_new' and f_c='5015' query3: select f_c, count(1) from source_table where ds='2010-09-23' and (f_c='5015' or f_c='4960'or f_c='2100'or f_c='2144'or f_c='1451') and `group`='wizard_generate_new' group by f_c query4: insert overwrite table temp_result select * from source_table where (f_c='5015' or f_c='4960'or f_c='2100'or f_c='2144'or f_c='1451') and `group`='wizard_generate_new' and ds='2010-09-23' We measured CPU costs. We compare CPU Cycles reported by MapReduced framework and CPU time reported by hmon service: Map CPU Cycle (MapRed Framework) Total CPU Time (hmon) Old CPU Cycle New CPU Cycle Increase Old CPU time New CPU Time Increase Query 1 12,052,635 6,987,915 42.0% 45,875 23,022 49.8% Query 2 12,164,920 10,678,800 12.2% 46,759 42,186 9.8% Query 3 27,258,930 21,609,840 20.7% 116,113 93,484 19.5% Query 4 30,604,180 20,912,570 31.7% 115,883 79,492 31.4% > convert commonly used udfs to generic udfs > ------------------------------------------ > > Key: HIVE-1638 > URL: https://issues.apache.org/jira/browse/HIVE-1638 > Project: Hadoop Hive > Issue Type: Improvement > Components: Query Processor > Reporter: Namit Jain > Assignee: Siying Dong > Attachments: HIVE-1638.1.patch > > > Copying a mail from Joy: > i did a little bit of profiling of a simple hive group by query today. i was > surprised to see that one of the most expensive functions were in converting > the equals udf (i had some simple string filters) to generic udfs. > (primitiveobjectinspectorconverter.textconverter) > am i correct in thinking that the fix is to simply port some of the most > popular udfs (string equality/comparison etc.) to generic udsf? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.