Re: Data manipulation in Hive over Tez

Rajesh Balamohan Tue, 29 Nov 2016 02:05:58 -0800

Hi Robert,

1. At high level, you can refer to https://github.com/apache/
hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/
exec/tez/DagUtils.java where different vertices, edges etc gets created as
per the execution plan.
Consider a vertex as a combination of input, processing logic and output.
Different vertices are connected together by edges which can define the
data movement logic (broadcast or scatter-gather or one-to-one etc).
In the edge configuration, type of key/value class is defined. This DAG is
submitted to Tez for execution.

2. For task processing, you can refer to https://github.com/apache/
hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/
exec/tez/TezProcessor.java in hive side.

3. In Tez side, there are different type of inputs and outputs available.
E.g OrderedGroupedKVInput, UnorderedKVInput, OrderedPartitionedKVOutput,
UnorderedKVOutput, UnorderedPartitionedKVOutput etc are available for
reading/writing data.
https://github.com/apache/tez/blob/master/tez-runtime-
library/src/main/java/org/apache/tez/runtime/library/
input/OrderedGroupedKVInput.java
https://github.com/apache/tez/blob/master/tez-runtime-
library/src/main/java/org/apache/tez/runtime/library/
input/UnorderedKVInput.java
https://github.com/apache/tez/blob/master/tez-runtime-
library/src/main/java/org/apache/tez/runtime/library/
output/UnorderedKVOutput.java
https://github.com/apache/tez/blob/master/tez-runtime-
library/src/main/java/org/apache/tez/runtime/library/output/
OrderedPartitionedKVOutput.java

For instance, ordered output would write the data in sorted format. There
are different type of sorters available in Tez which can be chosen at
runtime (DefaultSorter, PipelinedSorter). Intermedate data of tasks are
written in
"IFile" format which is similar to the IFile format in MR world, but has
more optimizations involved in tez impl.

https://github.com/apache/tez/blob/master/tez-runtime-
library/src/main/java/org/apache/tez/runtime/library/
common/sort/impl/IFile.java.

As far as the reading is concerned, key/value class and serializer
information is passed on as a part of creating the DAG. E.g
https://github.com/apache/hive/blob/master/ql/src/java/
org/apache/hadoop/hive/ql/exec/tez/DagUtils.java#L360
https://github.com/apache/tez/blob/master/tez-runtime-
library/src/main/java/org/apache/tez/runtime/library/common/readers/
UnorderedKVReader.java
https://github.com/apache/tez/blob/master/tez-runtime-
library/src/main/java/org/apache/tez/runtime/library/
common/ValuesIterator.java

~Rajesh.B

On Sat, Nov 26, 2016 at 5:13 AM, Robert Grandl <[email protected]>
wrote:

> Hi guys,
> I am not sure where is the right place to post this question hence I send
> it to both hive and tez dev mailing lists.
>
> I am trying to get a better understanding of how the input / output for a
> task is handled.  Typically input stages read the data to be processed.
> Next, all the data will flow in forms of key / value pairs till the end of
> the job's execution.
>
> 1. Could you guys can point me out to the key files where I should look to
> identify that? I am mostly interested to intercept where data is read by a
> task and wher the data is written after the task process the input  data.
>
> 2. Also, is there a way I can identify the types (and hence read the
> actual values) of a key / value pair instead of just Object key, Object
> value?
> Thanks in advance,Robert
>
>

-- 
~Rajesh.B

Re: Data manipulation in Hive over Tez

Reply via email to