[
https://issues.apache.org/jira/browse/HIVE-11188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zheng Shao updated HIVE-11188:
------------------------------
Priority: Major (was: Minor)
> Make ORCFile's String Dictionary more efficient
> -----------------------------------------------
>
> Key: HIVE-11188
> URL: https://issues.apache.org/jira/browse/HIVE-11188
> Project: Hive
> Issue Type: Improvement
> Components: File Formats
> Affects Versions: 1.2.0, 1.1.0
> Reporter: Zheng Shao
>
> Currently, ORCFile's String Dictionary uses StringRedBlackTree for
> adding/finding/sorting duplicate strings. When there are a large number of
> unique strings (let's say over 16K) and a large number of rows (let's say
> 1M), the binary search will take O(1M * log(16K)) time which can be very long.
> Alternatively, ORCFile's String Dictionary can use HashMap for adding/finding
> duplicate strings, and use quicksort at the end to produce a sorted order.
> In the same case above, the total time spent will be O(1M + 16K * log(16K))
> which is much faster.
> When the number of unique string is close to the number of rows (let's say,
> both around 1M), ORC will automatically disable the dictionary encoding. In
> the old approach will take O(1M * log(1M)), and our new approach will take
> O(1M) since we can skip the final quicksort if the dictionary encoding is
> disabled.
> So in either case, the new approach should be a win.
> Here is an PMP output based on ~600 traces (so 126 means 126/600 ~= 21% of
> total time). It's a query like "INSERT OVERWRITE TABLE SELECT * FROM src"
> using hive-1.1.0-cdh-5.4.1.
> 126
> org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.compareValue(StringRedBlackTree.java:67)
> 35 java.util.zip.Deflater.deflateBytes(Native Method)
> 26
> org.apache.hadoop.hive.ql.io.orc.SerializationUtils.findClosestNumBits(SerializationUtils.java:218)
> 24
> org.apache.hadoop.hive.serde2.lazy.LazyNonPrimitive.isNull(LazyNonPrimitive.java:63)
> 22 org.apache.hadoop.hive.serde2.lazy.LazyMap.parse(LazyMap.java:204)
> 22
> org.apache.hadoop.hive.serde2.lazy.LazyLong.parseLong(LazyLong.java:116)
> 21
> org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:111)
> 19
> org.apache.hadoop.hive.serde2.lazy.LazyPrimitive.hashCode(LazyPrimitive.java:57)
> 18
> org.apache.hadoop.hive.ql.io.orc.RedBlackTree.getRight(RedBlackTree.java:99)
> 16
> org.apache.hadoop.hive.ql.io.RCFile$Reader.getCurrentRow(RCFile.java:1932)
> 15
> org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native
> Method)
> 15
> org.apache.hadoop.hive.ql.io.orc.WriterImpl$IntegerTreeWriter.write(WriterImpl.java:929)
> 12
> org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.write(WriterImpl.java:1607)
> 12 org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
> 11
> org.apache.hadoop.hive.ql.io.orc.RedBlackTree.getLeft(RedBlackTree.java:92)
> 11
> org.apache.hadoop.hive.ql.io.orc.DynamicIntArray.add(DynamicIntArray.java:105)
> 10 org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)
> ...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)