Re: Data manipulation in Hive over Tez

Robert Grandl Thu, 01 Dec 2016 17:42:44 -0800

 Thanks Rajesh for your answer. That was really helpful. 

I would like to ask you few more questions. I am trying to better understand 
how the <Key, Value> pairs are propagated and processed at various vertices.

Edge:- encodes the data movement logic
Processing logic:- process and partition the output key space according to its 
logic- also the processing logic in every stage follows a sequence of operators 
through which every key, value pair is passed
My questions are:1)I am a bit confused till what extent the processing logic in 
a stage goes in (especially Reduce Tasks).  Like, given an input in terms of 
<Key, Value> pairs what are typical patterns of processing logic i.e. what kind 
of <Key, Value> pairs it can produce and how much changes can the vertex do. 
This question is a bit confusing, but basically I am trying to understand what 
kind of patterns of  input {<Key, Value>, output <Key, Value>} patterns can be 
handled in general by a typical processing logic for SQL queries written in 
Hive atop Tez. 

2) Can't really wrap up my head how much connection exists between data 
movement encoded in edges and how the <Key, Value> pairs are generated by a 
vertex and moved to corresponding downstream vertices.

Thanks again for your answers,Robert

 On Tuesday, November 29, 2016 4:04 AM, Rajesh Balamohan 
<[email protected]> wrote:

 Hi Robert,

1. At high level, you can refer to https://github.com/apache/
hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/
exec/tez/DagUtils.java where different vertices, edges etc gets created as
per the execution plan.
Consider a vertex as a combination of input, processing logic and output.
Different vertices are connected together by edges which can define the
data movement logic (broadcast or scatter-gather or one-to-one etc).
In the edge configuration, type of key/value class is defined. This DAG is
submitted to Tez for execution.

2. For task processing, you can refer to https://github.com/apache/
hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/
exec/tez/TezProcessor.java in hive side.

3. In Tez side, there are different type of inputs and outputs available.
E.g OrderedGroupedKVInput, UnorderedKVInput, OrderedPartitionedKVOutput,
UnorderedKVOutput, UnorderedPartitionedKVOutput etc are available for
reading/writing data.
https://github.com/apache/tez/blob/master/tez-runtime-
library/src/main/java/org/apache/tez/runtime/library/
input/OrderedGroupedKVInput.java
https://github.com/apache/tez/blob/master/tez-runtime-
library/src/main/java/org/apache/tez/runtime/library/
input/UnorderedKVInput.java
https://github.com/apache/tez/blob/master/tez-runtime-
library/src/main/java/org/apache/tez/runtime/library/
output/UnorderedKVOutput.java
https://github.com/apache/tez/blob/master/tez-runtime-
library/src/main/java/org/apache/tez/runtime/library/output/
OrderedPartitionedKVOutput.java

For instance, ordered output would write the data in sorted format. There
are different type of sorters available in Tez which can be chosen at
runtime (DefaultSorter, PipelinedSorter). Intermedate data of tasks are
written in
"IFile" format which is similar to the IFile format in MR world, but has
more optimizations involved in tez impl.

https://github.com/apache/tez/blob/master/tez-runtime-
library/src/main/java/org/apache/tez/runtime/library/
common/sort/impl/IFile.java.

As far as the reading is concerned, key/value class and serializer
information is passed on as a part of creating the DAG. E.g
https://github.com/apache/hive/blob/master/ql/src/java/
org/apache/hadoop/hive/ql/exec/tez/DagUtils.java#L360
https://github.com/apache/tez/blob/master/tez-runtime-
library/src/main/java/org/apache/tez/runtime/library/common/readers/
UnorderedKVReader.java
https://github.com/apache/tez/blob/master/tez-runtime-
library/src/main/java/org/apache/tez/runtime/library/
common/ValuesIterator.java

~Rajesh.B

On Sat, Nov 26, 2016 at 5:13 AM, Robert Grandl <[email protected]>
wrote:

> Hi guys,
> I am not sure where is the right place to post this question hence I send
> it to both hive and tez dev mailing lists.
>
> I am trying to get a better understanding of how the input / output for a
> task is handled.  Typically input stages read the data to be processed.
> Next, all the data will flow in forms of key / value pairs till the end of
> the job's execution.
>
> 1. Could you guys can point me out to the key files where I should look to
> identify that? I am mostly interested to intercept where data is read by a
> task and wher the data is written after the task process the input  data.
>
> 2. Also, is there a way I can identify the types (and hence read the
> actual values) of a key / value pair instead of just Object key, Object
> value?
> Thanks in advance,Robert
>
>

-- 
~Rajesh.B

Re: Data manipulation in Hive over Tez

Reply via email to