Zhu Li created HIVE-14131:
-----------------------------
Summary: Performance
Key: HIVE-14131
URL: https://issues.apache.org/jira/browse/HIVE-14131
Project: Hive
Issue Type: Improvement
Components: HCatalog
Reporter: Zhu Li
Assignee: Zhu Li
1. In HCatalog, the code used for lazy deserialization in
HCatRecordReader.java uses a method named getPosition(fieldName) for getting
index of a filed in a row. When it is invoked, it also invokes toLowerCase()
method for the String variable fieldName. This is trivial when data size is
small, but when data size is huge, repeated invocations of toLowerCase() for
the same set of fieldNames wastes some time. So storing the indices for the
columns names in HcatRecordReader class or storing lower-case fieldNames in
outputSchema will improve efficiency.
2. HCatRecordReader.java is creating new instance of DefaultHCatRecord
repeatedly for every new incoming row of data. This causes a waste of time.
Adding a private variable of DefaultHCatRecord in this class and using it
repeatedly for new rows will reduce some overhead.
3. Method serializePrimitiveField in class HCatRecordSerDe.java is invoking
HCatContext.INSTANCE.getConf() repeatedly. This also causes some overhead
according to result by JProfiler. Adding a static boolean field in
HCatRecordSerDe.java which stores HCatContext.INSTANCE.getConf().isPresent()
and another static Configuration variable which stores result of
HCatContext.INSTANCE.getConf() also reduces overhead.
According to my test on a cluster, using the above modifications we can save
80 seconds or so when HCatalog is used to load a table in size of 1
billion(rows) * 40(columns) with various data types.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)