[
https://issues.apache.org/jira/browse/PIG-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649457#action_12649457
]
Mridul Muralidharan commented on PIG-539:
-----------------------------------------
Hi Chris,
We had a similar requirement - where the number of map tasks was high
(because of a large number of small files created as part of the pipeline
prefix) and we wanted a small & fixed number of map tasks (== number of mappers
in cluster).
The only way I found to control the behavior in this case was something
extremely heavyweight - do :
LOAD, GROUP (with parallel), FOREACH/FLATTEN and rest of pipeline.
Apparently, there is no other way to do this in pig currently ...
> unable to control parallelism of Map tasks
> ------------------------------------------
>
> Key: PIG-539
> URL: https://issues.apache.org/jira/browse/PIG-539
> Project: Pig
> Issue Type: Bug
> Components: impl
> Environment: local execution + hadoop execution
> Reporter: Christopher Olston
>
> I put "PARALLEL 1" following *every* statement in my pig script, and it still
> executes maps with more than 1 parallel task. This is a major problem because
> for one of my operations I need to have a serialized (non-parallel) map.
> Probably the semantics of parallelism should be as follows:
> 1. group pig operators into map/reduce stages
> 2. for each stage, take the minimum of the "Parallel" directives given by
> the user for statements executed as part of that stage
> (We'll have to decide on a rule for statements that use the combiner, which
> execute partially on the map side and partially on the reduce side ...)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.