[jira] Updated: (HIVE-1638) convert commonly used udfs to generic udfs

Siying Dong (JIRA) Wed, 29 Sep 2010 00:09:04 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Siying Dong updated HIVE-1638:
------------------------------

    Attachment: HIVE-1638.1.patch

Write GenericUDF functions for logical and, or, not, comparison operation 
equal, not equal, greater, less, not greater, not less. Remove respective UDFs. 
Make other codes changes to turn to use the new functions.

I ran some sample queries and didn't find performance regression in any of 
those queries.

Then I measure improvement against some normal queries whose performance this 
change is expected to improve (basically queries with some filters, especially 
string comparison).

Sample queries were executed against `source_table`, which has the same 
data-set as a production table. `source_table` is a table with 422 files, total 
size 127,881,234,652 bytes. Compressed using RCFormat. It has 18 non-partition 
columns. ds is the partition column.Partition 

ds='2010-09-23' has about 5600M rows. Values of column `group` in most rows are 
"wizard_generate_new" (`group`="wizard_generate_new" is not very selective). 
f_c is a column whose 

values are widely spread. '5015', '4960', '2100', '2144' and '1451' are some 
values that have thousands of rows (f_c='xx' is very selective). Split size was 
set to a value so that 87 mappers were used for all the queries.

query1:
select count(1) from source_table where f_c='5015' and 
`group`='wizard_generate_new' and ds='2010-09-23'

query2:
select count(1) from source_table where ds='2010-09-23' and 
`group`='wizard_generate_new' and f_c='5015'

query3:
select f_c, count(1) from source_table where ds='2010-09-23' and (f_c='5015' or 
f_c='4960'or f_c='2100'or f_c='2144'or f_c='1451') and 
`group`='wizard_generate_new' group by f_c

query4:
insert overwrite table temp_result select * from source_table where (f_c='5015' 
or f_c='4960'or f_c='2100'or f_c='2144'or f_c='1451') and 
`group`='wizard_generate_new' and 

ds='2010-09-23'

We measured CPU costs. We compare CPU Cycles reported by MapReduced framework 
and CPU time reported by hmon service:

        Map CPU Cycle (MapRed Framework)                Total CPU Time (hmon)   
        
        Old CPU Cycle   New CPU Cycle   Increase        Old CPU time    New CPU 
Time    Increase
Query 1 12,052,635      6,987,915       42.0%           45,875  23,022  49.8%
Query 2 12,164,920      10,678,800      12.2%           46,759  42,186  9.8%
Query 3 27,258,930      21,609,840      20.7%           116,113 93,484  19.5%
Query 4 30,604,180      20,912,570      31.7%           115,883 79,492  31.4%



> convert commonly used udfs to generic udfs
> ------------------------------------------
>
>                 Key: HIVE-1638
>                 URL: https://issues.apache.org/jira/browse/HIVE-1638
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Siying Dong
>         Attachments: HIVE-1638.1.patch
>
>
> Copying a mail from Joy:
> i did a little bit of profiling of a simple hive group by query today. i was 
> surprised to see that one of the most expensive functions were in converting 
> the equals udf (i had some simple string filters) to generic udfs. 
> (primitiveobjectinspectorconverter.textconverter)
> am i correct in thinking that the fix is to simply port some of the most 
> popular udfs (string equality/comparison etc.) to generic udsf?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1638) convert commonly used udfs to generic udfs

Reply via email to