[ 
https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626492#action_12626492
 ] 

Enis Soztutar commented on HADOOP-3702:
---------------------------------------

bq. Making ChainMapper<K1, V1, K2, V2> does not make sense in this case as a 
developer is never instantiating it. Thus there is not type checking being done 
at compilation here for K1, V1, K2, V2 .
It does not matter whether the classes are instantiated by the user or not, it 
is better to have them use generics properly as much as we can, so that we can 
get rid of the @SuppressWarnings("unchecked"). IndentityMapper is also never 
instantiated by the user. 

bq. Even if you would do create an instance of ChainMapper bound to concrete 
classes for K1, V1, K2, V2 it would work for the common case of mappers in the 
chain using different key/value classes.See, the contract between the maps in a 
chain is that the input key/values classes of the first mapper are the same as 
the input key/values classes of the job, the input key/values classes of the 
second mapper are the same as the output key/values classes of the first 
mapper, and so on, and the output key/value classes of the last mapper in the 
chain (for the ChainMapper) are the same as the input key/values classes of the 
reducer.
Yes I know that the whole dataflow would be 
{code}
(K1, V1)  -> (Ka, Va) -> ... -> (K2, V2) -> (k3, V3) 
{code}
where first mapper takes K1, V1, and last mapper outputs K2, V2, and the patch 
checks for that. 
For example Chain#mappers are Mapper<?,?,?,?> does not enforce the mappers to 
use the same input / output classes, and ChainOutputCollector uses <K,V,KO,VO> 
so that its input and output classes are enforced (but unfortunately we cannot 
instantiate ChainOutputCollectors with bound parameters. ). 
Chain#addMapper() is <i>static</i> so that its generic <K1,V1,K2,V2> parameters 
are not bound by the Chain<K1,...V3> parameters.  So you can again use 
{code}
Jobconf conf = ...
  ChainMapper.addMapper(conf, AMap.class, K1.class, V1.class, Ka.class, 
Va.class, null);
  ChainMapper.addMapper(conf, BMap.class, Ka.class, Va.class, Kb.class, 
Vb.class, null);
  ChainMapper.addMapper(conf, CMap.class, Kb.class, Vb.class, K2.class, 
V2.class, null);
{code}
My version of the patch uses a similar approach to JobConf, in that we try to 
use generics as much as possible, which is, I think, should be the preferred 
way. 



> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, 
> patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, 
> patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other 
> without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a 
> significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce 
> phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to