[
https://issues.apache.org/jira/browse/HIVE-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zhu Li updated HIVE-14130:
--------------------------
Summary: HCatalog improvement by reducing invocations of toLowerCase() for
fieldNames and repeatedly using DefaultHCatRecord in HCatRecordReader, and
adding static fields in HCatRecordSerDe.java (was: HCatalog improvement by
reducing invocations of toLowerCase() for fieldNames, repeatedly using
DefaultHCatRecord, and adding static fields in HCatRecordSerDe.java)
> HCatalog improvement by reducing invocations of toLowerCase() for fieldNames
> and repeatedly using DefaultHCatRecord in HCatRecordReader, and adding static
> fields in HCatRecordSerDe.java
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-14130
> URL: https://issues.apache.org/jira/browse/HIVE-14130
> Project: Hive
> Issue Type: Improvement
> Components: HCatalog
> Reporter: Zhu Li
> Assignee: Zhu Li
> Labels: patch, performance
> Original Estimate: 216h
> Remaining Estimate: 216h
>
> 1. In HCatalog, the code used for lazy deserialization in
> HCatRecordReader.java uses a method named getPosition(fieldName) for getting
> index of a filed in a row. When it is invoked, it also invokes toLowerCase()
> method for the String variable fieldName. This is trivial when data size is
> small, but when data size is huge, repeated invocations of toLowerCase() for
> the same set of fieldNames wastes some time. So storing the indices for the
> columns names in HcatRecordReader class or storing lower-case fieldNames in
> outputSchema will improve efficiency.
> 2. HCatRecordReader.java is creating new instance of DefaultHCatRecord
> repeatedly for every new incoming row of data. This causes a waste of time.
> Adding a private variable of DefaultHCatRecord in this class and using it
> repeatedly for new rows will reduce some overhead.
> 3. Method serializePrimitiveField in class HCatRecordSerDe.java is invoking
> HCatContext.INSTANCE.getConf() repeatedly. This also causes some overhead
> according to result by JProfiler. Adding a static boolean field in
> HCatRecordSerDe.java which stores HCatContext.INSTANCE.getConf().isPresent()
> and another static Configuration variable which stores result of
> HCatContext.INSTANCE.getConf() also reduces overhead.
> According to my test on a cluster, using the above modifications we can save
> 80 seconds or so when HCatalog is used to load a table in size of 1
> billion(rows) * 40(columns) with various data types.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)