Travis Woodruff created PIG-5326:
------------------------------------

             Summary: Issue with auto parallelism and scalar inputs in Tez
                 Key: PIG-5326
                 URL: https://issues.apache.org/jira/browse/PIG-5326
             Project: Pig
          Issue Type: Bug
          Components: tez
            Reporter: Travis Woodruff


I'm getting a "Scalar has more than one row in the output" error with the 
following script:

{code}
a = LOAD 't' as (x:chararray);
b = GROUP a BY x PARALLEL 2;
c = GROUP b by group;
d = FOREACH (GROUP a ALL) GENERATE COUNT(a) as count;
e = FOREACH c GENERATE group, d.count;
DUMP e;
{code}

If I add a PARALLEL clause to {{c}}, the error goes away, so the issue seems to 
be related to auto parallelism.

I'm not very familiar with Tez, so I'm not sure how things are supposed to 
work, the issue seems to be related to the following (I know almost nothing 
about Tez so take this with a grain of salt):

# {{PigGraceShuffleVertexManager}} calls {{VertexImpl.reconfigureVertex()}}, 
which configures the parallelism of the vertex ({{VertexImpl.numTasks}})
# The {{InputSpec}} for the scalar input is created (via 
{{Edge.getDestinationSpec()}}) with {{physicalInputCount}} equal to the 
parallelism set above
# The input is created (in {{LogicalIOProcessorRuntimeTask.createInput()}}) 
based on this {{InputSpec}}.
# The resulting {{UnorderedKVInput}} creates a {{ShuffleManager}} with 
{{numInputs}} = {{numPhysicalInputs}}.
 
This creates a reader that reads the scalar input {{numPhysicalInputs}} times, 
which results in the "Scalar has more than one row in the output" error in 
{{ReadScalarsTez}}.

When parallelism is specified explicitly, {{VertexImpl.reconfigureVertex()}} is 
never called, and {{numPhysicalInputs}} remains as 1 for the scalar input.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to