[
https://issues.apache.org/jira/browse/PIG-162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594957#action_12594957
]
Shravan Matthur Narayanamurthy commented on PIG-162:
----------------------------------------------------
Thanks for reviewing the patch Arun. My responses in line..
1. Hadoop doesn't require Writables from 0.17.0 onwards: HADOOP-1986, so you
could use that as an advantage.
[shrav] This is great. However, in the types branch we are still at hadoop-15.
We plan to merge the changes in the main branch later and I think this is a
good candidate to be taken up then. I thought about it for a while. Currently,
we do not have a umbrella class for our types other than WritableComparable.
Could not come up with a neat soln for this. Need to think more on this.
Certainly a good point and we need to spend more thime on this one.
2. I agree with Alan about map-only jobs, just use something similar to PIG-196
(my unbiased opinion smile).
[shrav] I think doing something like PIG-196 would incur a branching in every
call to the map function checking whether it is a map-only job. This additional
complexity is due to the introduction of types. In the map-only jobs, we don't
care about extracting the key & indexed tuple. In a map-reduce job, we have to
do the extraction. This is the branching I wanted to avoid. I guess I gave a
naive solution by duplicating code; one for map-only & the other for
map-reduce. I guess a better solution, as Alan suggested would be to subclass
both map-only & map-reduce Map classes and have an abstract collectKeyAndTuple
function which will be implemented in the map-only & map-reduce classes
accordingly.
3. RunnableReporter is a thread which blindly does 'reporting'. This makes it
very hard to debug when applications go haywire. By this, you are going to miss
a very important safety net provided Hadoop Map-Reduce i.e. the ability to kill
tasks which aren't 'progressing'. Please do not do this! Ideally you should be
using the reporter in the map/reduce functions to report progress when tuples
are being consumed.
[shrav] You are right. I will change that. Since this is a major change, I will
do it once this patch and Shubham's patch is in. I will write a proposal on the
changes and submit it.
4. The 'Slicer' notion is missing from PigInputFormat/PigSplit... are you
planning to integrate it later?
[shrav] Yeah we have left it to the merging phase later
5. It's great that you are using Hadoop's jobcontrol, please let us know if
anything was amiss here: HADOOP- 2484.
[shrav] It works well. Probably some more documentation would be helpful.
Regarding Pig's notion of "Properties", are you referring to the backend and
datastorage? If so, I think we need to take this up during the merge of changes
from the main branch
Regarding, creating a separate JobContorlCompiler, I did that because, I wanted
to leave some room for the optimizer to act. So once the MROperPlan is built,
it can be optimized and then JobControlCompiler can work on the optimized plan
to generate the Job Control.
> Rework mapreduce submission and monitoring
> ------------------------------------------
>
> Key: PIG-162
> URL: https://issues.apache.org/jira/browse/PIG-162
> Project: Pig
> Issue Type: Sub-task
> Environment: This bug tracks works to rework the submission and
> monitoring interface to map reduce as described in
> http://wiki.apache.org/pig/PigTypesFunctionalSpec
> Reporter: Alan Gates
> Assignee: Alan Gates
> Attachments: mapreduceJumbo.patch, split.png,
> TEST-org.apache.pig.test.TestMRCompiler.txt,
> TEST-org.apache.pig.test.TestUnion.txt
>
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.