[jira] [Comment Edited] (HIVE-14130) HCatalog improvement by reducing invocations of toLowerCase() for fieldNames, repeatedly using DefaultHCatRecord, and adding static fields in HCatRecordSerDe.java

Zhu Li (JIRA) Wed, 29 Jun 2016 14:16:33 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15355791#comment-15355791
 ]


Zhu Li edited comment on HIVE-14130 at 6/29/16 9:16 PM:
--------------------------------------------------------

Please let me know if anyone familiar with HCatalog finds there is anything 
wrong with my idea in the issue.


was (Author: zhu li):
Please let me know if there is anything wrong with my idea in the issue.

> HCatalog improvement by reducing invocations of toLowerCase() for fieldNames, 
> repeatedly using DefaultHCatRecord, and adding static fields in  
> HCatRecordSerDe.java
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-14130
>                 URL: https://issues.apache.org/jira/browse/HIVE-14130
>             Project: Hive
>          Issue Type: Improvement
>          Components: HCatalog
>            Reporter: Zhu Li
>            Assignee: Zhu Li
>              Labels: patch, performance
>   Original Estimate: 216h
>  Remaining Estimate: 216h
>
> 1. In HCatalog,  the code used for lazy deserialization in 
> HCatRecordReader.java uses a method named getPosition(fieldName) for getting 
> index of a filed in a row. When it is invoked, it also invokes toLowerCase() 
> method for the String variable fieldName. This is trivial when data size is 
> small, but when data size is huge, repeated invocations of toLowerCase() for 
> the same set of fieldNames wastes some time. So storing the indices for the 
> columns names in HcatRecordReader class or storing lower-case fieldNames in 
> outputSchema will improve efficiency. 
> 2. HCatRecordReader.java is creating new instance of DefaultHCatRecord 
> repeatedly for every new incoming row of data. This causes a waste of time. 
> Adding a private variable of DefaultHCatRecord in this class and using it 
> repeatedly for new rows will reduce some overhead.
> 3. Method serializePrimitiveField in class HCatRecordSerDe.java is invoking 
> HCatContext.INSTANCE.getConf() repeatedly. This also causes some overhead 
> according to result by JProfiler. Adding a static boolean field in 
> HCatRecordSerDe.java which stores HCatContext.INSTANCE.getConf().isPresent() 
> and another static Configuration variable which stores result of 
> HCatContext.INSTANCE.getConf() also reduces overhead.
>  According to my test on a cluster, using the above modifications we can save 
> 80 seconds or so when HCatalog is used to load a table in size of 1 
> billion(rows) * 40(columns) with various data types. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (HIVE-14130) HCatalog improvement by reducing invocations of toLowerCase() for fieldNames, repeatedly using DefaultHCatRecord, and adding static fields in HCatRecordSerDe.java

Reply via email to