[ 
https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716015#action_12716015
 ] 

Ashish Thusoo commented on HIVE-537:
------------------------------------

One thing that you need to be careful about is the fact that you will be 
increasing the number of rows between the map and the reduce boundaries which, 
if there are a lot of distincts can lead to data explosion and a subsequent 
slowdown in the sort.

>From that I mean the following:

Suppose we have a query with m different distincts and the base table with N 
rows and p mappers and r reducers
By doing multiple map/reduce jobs, the predominant term in our complexity is

O(mN/p) + O(m(N/p log (N/p))) + O(mN/r) + O(m)

ie.
map side scan + map side sort + Reduce side merge + fixed cost of starting the 
map/reduce job.

how with the current approach the corresponding formula will be

O(mN/p) + O(mN/p log (mN/p)) + O(mN/r)
=
O(mN/p) + O(mN/p log (N/p)) + O(mN/p log m) + O(mN/r)

There may be situations where one is better than the other... Something to keep 
in mind.


> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and 
> map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>
> There are already some cases inside the code that we use heterogeneous data: 
> JoinOperator, and UnionOperator (in the sense that different parents can pass 
> in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that 
> approach does not extend to more complex plans that might be needed in the 
> future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On 
> deserialization, we will first read out the tag byte, then we know what is 
> the current type of the following object, so we can deserialize it 
> successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to