[jira] [Updated] (HIVE-11188) Make ORCFile's String Dictionary more efficient

Zheng Shao (JIRA) Mon, 06 Jul 2015 18:18:54 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-11188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Zheng Shao updated HIVE-11188:
------------------------------
    Priority: Major  (was: Minor)

> Make ORCFile's String Dictionary more efficient
> -----------------------------------------------
>
>                 Key: HIVE-11188
>                 URL: https://issues.apache.org/jira/browse/HIVE-11188
>             Project: Hive
>          Issue Type: Improvement
>          Components: File Formats
>    Affects Versions: 1.2.0, 1.1.0
>            Reporter: Zheng Shao
>
> Currently, ORCFile's String Dictionary uses StringRedBlackTree for 
> adding/finding/sorting duplicate strings.  When there are a large number of 
> unique strings (let's say over 16K) and a large number of rows (let's say 
> 1M), the binary search will take O(1M * log(16K)) time which can be very long.
> Alternatively, ORCFile's String Dictionary can use HashMap for adding/finding 
> duplicate strings, and use quicksort at the end to produce a sorted order.  
> In the same case above, the total time spent will be O(1M + 16K * log(16K)) 
> which is much faster.
> When the number of unique string is close to the number of rows (let's say, 
> both around 1M), ORC will automatically disable the dictionary encoding.  In 
> the old approach will take O(1M * log(1M)), and our new approach will take 
> O(1M) since we can skip the final quicksort if the dictionary encoding is 
> disabled.
> So in either case, the new approach should be a win.
> Here is an PMP output based on ~600 traces (so 126 means 126/600 ~= 21% of 
> total time). It's a query like "INSERT OVERWRITE TABLE SELECT * FROM src" 
> using hive-1.1.0-cdh-5.4.1.
>     126  
> org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.compareValue(StringRedBlackTree.java:67)
>      35  java.util.zip.Deflater.deflateBytes(Native Method)
>      26  
> org.apache.hadoop.hive.ql.io.orc.SerializationUtils.findClosestNumBits(SerializationUtils.java:218)
>      24  
> org.apache.hadoop.hive.serde2.lazy.LazyNonPrimitive.isNull(LazyNonPrimitive.java:63)
>      22  org.apache.hadoop.hive.serde2.lazy.LazyMap.parse(LazyMap.java:204)
>      22  
> org.apache.hadoop.hive.serde2.lazy.LazyLong.parseLong(LazyLong.java:116)
>      21  
> org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:111)
>      19  
> org.apache.hadoop.hive.serde2.lazy.LazyPrimitive.hashCode(LazyPrimitive.java:57)
>      18  
> org.apache.hadoop.hive.ql.io.orc.RedBlackTree.getRight(RedBlackTree.java:99)
>      16  
> org.apache.hadoop.hive.ql.io.RCFile$Reader.getCurrentRow(RCFile.java:1932)
>      15  
> org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native 
> Method)
>      15  
> org.apache.hadoop.hive.ql.io.orc.WriterImpl$IntegerTreeWriter.write(WriterImpl.java:929)
>      12  
> org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.write(WriterImpl.java:1607)
>      12  org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
>      11  
> org.apache.hadoop.hive.ql.io.orc.RedBlackTree.getLeft(RedBlackTree.java:92)
>      11  
> org.apache.hadoop.hive.ql.io.orc.DynamicIntArray.add(DynamicIntArray.java:105)
>      10  org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)
>      ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-11188) Make ORCFile's String Dictionary more efficient

Reply via email to