[ https://issues.apache.org/jira/browse/TEZ-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030002#comment-14030002 ]
Siddharth Seth commented on TEZ-1168: ------------------------------------- The counter names etc should not have changed; if you're referring to the Context objects - they're changed to no longer require an Input/OutputContext - and instead work with fields they actually require. Makes separation of MRReaders much simpler, which is what the patch also does - to re-use them between MRInput and MultiMRInput. SMB join - ends up reading multiple bucketed tables, each of which may have multiple partitions. Each bucket within individual partitions in a table is sorted. For a task, The stream table will have as source, a single bucket from either one or more partitions; this is joined with all partitions from the rest of the tables belonging to the same bucket. That's where individual sorted partitions belonging to the same bucket need to be accessed separately and merged together before joining, and where MultiMRInput comes in (the stream side also uses it though). > Add a MultiMRInput > ------------------ > > Key: TEZ-1168 > URL: https://issues.apache.org/jira/browse/TEZ-1168 > Project: Apache Tez > Issue Type: Improvement > Reporter: Siddharth Seth > Assignee: Siddharth Seth > Attachments: TEZ-1168.1.txt > > > A logical Input which can process multiple splits, and gives back a reader > for each of them. > Used by SMB joins in Hive - which have multiple sorted files, which need to > be merged together. -- This message was sent by Atlassian JIRA (v6.2#6252)