[jira] [Commented] (TEZ-1168) Add a MultiMRInput

Siddharth Seth (JIRA) Thu, 12 Jun 2014 16:29:50 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030002#comment-14030002
 ]


Siddharth Seth commented on TEZ-1168:
-------------------------------------

The counter names etc should not have changed; if you're referring to the 
Context objects - they're changed to no longer require an Input/OutputContext - 
and instead work with fields they actually require. Makes separation of 
MRReaders much simpler, which is what the patch also does - to re-use them 
between MRInput and MultiMRInput.

SMB join - ends up reading multiple bucketed tables, each of which may have 
multiple partitions. Each bucket within individual partitions in a table is 
sorted. For a task, The stream table will have as source, a single bucket from 
either one or more partitions; this is joined with all partitions from the rest 
of the tables belonging to the same bucket. That's where individual sorted 
partitions belonging to the same bucket need to be accessed separately and 
merged together before joining, and where MultiMRInput comes in (the stream 
side also uses it though).

> Add a MultiMRInput
> ------------------
>
>                 Key: TEZ-1168
>                 URL: https://issues.apache.org/jira/browse/TEZ-1168
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Siddharth Seth
>            Assignee: Siddharth Seth
>         Attachments: TEZ-1168.1.txt
>
>
> A logical Input which can process multiple splits, and gives back a reader 
> for each of them.
> Used by SMB joins in Hive - which have multiple sorted files, which need to 
> be merged together.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TEZ-1168) Add a MultiMRInput

Reply via email to