[
https://issues.apache.org/jira/browse/PIG-4871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
liyunzhang_intel updated PIG-4871:
----------------------------------
Description:
In current code base, we use OperatorPlan#forceConnect() while merge the
physical plan of spliter and splittee in MultiQueryOptimizationSpark.
The difference between OperatorPlan#connect and OperatorPlan#forceConnect is
not checking whether support multiOutputs and multiInputs or not in
forceConnect.
{code}
/**
* connect from and to and ignore some judgements: like ignoring judge
whether from operator supports multiOutputs
* and whether to operator supports multiInputs
*
* @param from Operator data will flow from.
* @param to Operator data will flow to.
* @throws PlanException if connect from or to which is not in the plan
*/
public void forceConnect(E from, E to) throws PlanException {
markDirty();
// Check that both nodes are in the plan.
checkInPlan(from);
checkInPlan(to);
mFromEdges.put(from, to);
mToEdges.put(to, from);
}
{code}
Let's use an example to explain why add forceConnect before.
{code}
before multiquery optimization:
scope-102->scope-108
scope-108
#--------------------------------------------------
# Spark Plan
#--------------------------------------------------
Spark node scope-102
Store(hdfs://zly1.sh.intel.com:8020/tmp/temp1293488480/tmp128312889:org.apache.pig.impl.io.InterStorage)
- scope-103
|
|---A: New For Each(false,false,false)[bag] - scope-75
| |
| Cast[int] - scope-67
| |
| |---Project[bytearray][0] - scope-66
| |
| Cast[int] - scope-70
| |
| |---Project[bytearray][1] - scope-69
| |
| Cast[int] - scope-73
| |
| |---Project[bytearray][2] - scope-72
|
|---A:
Load(hdfs://zly1.sh.intel.com:8020/user/root/split5:org.apache.pig.builtin.PigStorage)
- scope-65--------
Spark node scope-108
D:
Store(hdfs://zly1.sh.intel.com:8020/user/root/split5.out:org.apache.pig.builtin.PigStorage)
- scope-101
|
|---D: Filter[bag] - scope-97
| |
| Greater Than or Equal[boolean] - scope-100
| |
| |---Project[int][1] - scope-98
| |
| |---Project[int][4] - scope-99
|
|---C: New For Each(true,true)[tuple] - scope-96
| |
| Project[bag][1] - scope-94
| |
| Project[bag][2] - scope-95
|
|---C: Package(Packager)[tuple]{int} - scope-89
|
|---C: Global Rearrange[tuple] - scope-88
|
|---C: Local Rearrange[tuple]{int}(false) - scope-90
| | |
| | Project[int][0] - scope-91
| |
|
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp1293488480/tmp128312889:org.apache.pig.impl.io.InterStorage)
- scope-104
|
|---C: Local Rearrange[tuple]{int}(false) - scope-92
| |
| Project[int][0] - scope-93
|
|---B: New For Each(false,false)[bag] - scope-85
| |
| Project[int][0] - scope-81
| |
| Project[int][1] - scope-83
|
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp1293488480/tmp128312889:org.apache.pig.impl.io.InterStorage)
- scope-106--------
after multiquery optimization:
scope-108
#--------------------------------------------------
# Spark Plan
#--------------------------------------------------
Spark node scope-108
D:
Store(hdfs://zly1.sh.intel.com:8020/user/root/split5.out:org.apache.pig.builtin.PigStorage)
- scope-101
|
|---D: Filter[bag] - scope-97
| |
| Greater Than or Equal[boolean] - scope-100
| |
| |---Project[int][1] - scope-98
| |
| |---Project[int][4] - scope-99
|
|---C: New For Each(true,true)[tuple] - scope-96
| |
| Project[bag][1] - scope-94
| |
| Project[bag][2] - scope-95
|
|---C: Package(Packager)[tuple]{int} - scope-89
|
|---C: Global Rearrange[tuple] - scope-88
|
|---C: Local Rearrange[tuple]{int}(false) - scope-90
| | |
| | Project[int][0] - scope-91
| |
| |---A: New For Each(false,false,false)[bag] - scope-119
| | |
| | Cast[int] - scope-114
| | |
| | |---Project[bytearray][0] - scope-113
| | |
| | Cast[int] - scope-116
| | |
| | |---Project[bytearray][1] - scope-115
| | |
| | Cast[int] - scope-118
| | |
| | |---Project[bytearray][2] - scope-117
| |
| |---A:
Load(hdfs://zly1.sh.intel.com:8020/user/root/split5:org.apache.pig.builtin.PigStorage)
- scope-112
|
|---C: Local Rearrange[tuple]{int}(false) - scope-92
| |
| Project[int][0] - scope-93
|
|---B: New For Each(false,false)[bag] - scope-85
| |
| Project[int][0] - scope-81
| |
| Project[int][1] - scope-83
|
|---A: New For Each(false,false,false)[bag] - scope-128
| |
| Cast[int] - scope-123
| |
| |---Project[bytearray][0] - scope-122
| |
| Cast[int] - scope-125
| |
| |---Project[bytearray][1] - scope-124
| |
| Cast[int] - scope-127
| |
| |---Project[bytearray][2] - scope-126
|
|---A:
Load(hdfs://zly1.sh.intel.com:8020/user/root/split5:org.apache.pig.builtin.PigStorage)
- scope-121--------
{code}
we just copy subplan(Load:scope-121) of SparkOperator(scope-102) to
SparkOperator(scope-108), connect POLoad(scope-121) with POForEach(scope-119)
and POForEach(scope-128). Here POLoad supports multiOutputs. we can use
OperatorPlan#connect. If a physical operator does not support multiOutputs, we
have to use OperatorPlan#forceConnect to do the connection. Now we can copy
the subplan and generate a new subplan and merge the subplan to the final plan
to do multiquery optimization like following to solve this problem.
{code}
scope-108
#--------------------------------------------------
# Spark Plan
#--------------------------------------------------
Spark node scope-108
D:
Store(hdfs://zly1.sh.intel.com:8020/user/root/split5.out:org.apache.pig.builtin.PigStorage)
- scope-101
|
|---D: Filter[bag] - scope-97
| |
| Greater Than or Equal[boolean] - scope-100
| |
| |---Project[int][1] - scope-98
| |
| |---Project[int][4] - scope-99
|
|---C: New For Each(true,true)[tuple] - scope-96
| |
| Project[bag][1] - scope-94
| |
| Project[bag][2] - scope-95
|
|---C: Package(Packager)[tuple]{int} - scope-89
|
|---C: Global Rearrange[tuple] - scope-88
|
|---C: Local Rearrange[tuple]{int}(false) - scope-90
| | |
| | Project[int][0] - scope-91
| |
| |---A: New For Each(false,false,false)[bag] - scope-119
| | |
| | Cast[int] - scope-114
| | |
| | |---Project[bytearray][0] - scope-113
| | |
| | Cast[int] - scope-116
| | |
| | |---Project[bytearray][1] - scope-115
| | |
| | Cast[int] - scope-118
| | |
| | |---Project[bytearray][2] - scope-117
| |
| |---A:
Load(hdfs://zly1.sh.intel.com:8020/user/root/split5:org.apache.pig.builtin.PigStorage)
- scope-112
|
|---C: Local Rearrange[tuple]{int}(false) - scope-92
| |
| Project[int][0] - scope-93
|
|---B: New For Each(false,false)[bag] - scope-85
| |
| Project[int][0] - scope-81
| |
| Project[int][1] - scope-83
|
|---A: New For Each(false,false,false)[bag] - scope-128
| |
| Cast[int] - scope-123
| |
| |---Project[bytearray][0] - scope-122
| |
| Cast[int] - scope-125
| |
| |---Project[bytearray][1] - scope-124
| |
| Cast[int] - scope-127
| |
| |---Project[bytearray][2] - scope-126
|
|---A:
Load(hdfs://zly1.sh.intel.com:8020/user/root/split5:org.apache.pig.builtin.PigStorage)
- scope-132--------
{code}
was:
In current code base, we use OperatorPlan#forceConnect() while merge the
physical plan of spliter and splittee in MultiQueryOptimizationSpark.
The difference between OperatorPlan#connect and OperatorPlan#forceConnect is
not checking whether support multiOutputs and multiInputs or not in
forceConnect.
{code}
/**
* connect from and to and ignore some judgements: like ignoring judge
whether from operator supports multiOutputs
* and whether to operator supports multiInputs
*
* @param from Operator data will flow from.
* @param to Operator data will flow to.
* @throws PlanException if connect from or to which is not in the plan
*/
public void forceConnect(E from, E to) throws PlanException {
markDirty();
// Check that both nodes are in the plan.
checkInPlan(from);
checkInPlan(to);
mFromEdges.put(from, to);
mToEdges.put(to, from);
}
{code}
Let's use an example to explain why add forceConnect before.
{code}
before multiquery optimization:
scope-102->scope-108
scope-108
#--------------------------------------------------
# Spark Plan
#--------------------------------------------------
Spark node scope-102
Store(hdfs://zly1.sh.intel.com:8020/tmp/temp1293488480/tmp128312889:org.apache.pig.impl.io.InterStorage)
- scope-103
|
|---A: New For Each(false,false,false)[bag] - scope-75
| |
| Cast[int] - scope-67
| |
| |---Project[bytearray][0] - scope-66
| |
| Cast[int] - scope-70
| |
| |---Project[bytearray][1] - scope-69
| |
| Cast[int] - scope-73
| |
| |---Project[bytearray][2] - scope-72
|
|---A:
Load(hdfs://zly1.sh.intel.com:8020/user/root/split5:org.apache.pig.builtin.PigStorage)
- scope-65--------
Spark node scope-108
D:
Store(hdfs://zly1.sh.intel.com:8020/user/root/split5.out:org.apache.pig.builtin.PigStorage)
- scope-101
|
|---D: Filter[bag] - scope-97
| |
| Greater Than or Equal[boolean] - scope-100
| |
| |---Project[int][1] - scope-98
| |
| |---Project[int][4] - scope-99
|
|---C: New For Each(true,true)[tuple] - scope-96
| |
| Project[bag][1] - scope-94
| |
| Project[bag][2] - scope-95
|
|---C: Package(Packager)[tuple]{int} - scope-89
|
|---C: Global Rearrange[tuple] - scope-88
|
|---C: Local Rearrange[tuple]{int}(false) - scope-90
| | |
| | Project[int][0] - scope-91
| |
|
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp1293488480/tmp128312889:org.apache.pig.impl.io.InterStorage)
- scope-104
|
|---C: Local Rearrange[tuple]{int}(false) - scope-92
| |
| Project[int][0] - scope-93
|
|---B: New For Each(false,false)[bag] - scope-85
| |
| Project[int][0] - scope-81
| |
| Project[int][1] - scope-83
|
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp1293488480/tmp128312889:org.apache.pig.impl.io.InterStorage)
- scope-106--------
after multiquery optimization:
scope-108
#--------------------------------------------------
# Spark Plan
#--------------------------------------------------
Spark node scope-108
D:
Store(hdfs://zly1.sh.intel.com:8020/user/root/split5.out:org.apache.pig.builtin.PigStorage)
- scope-101
|
|---D: Filter[bag] - scope-97
| |
| Greater Than or Equal[boolean] - scope-100
| |
| |---Project[int][1] - scope-98
| |
| |---Project[int][4] - scope-99
|
|---C: New For Each(true,true)[tuple] - scope-96
| |
| Project[bag][1] - scope-94
| |
| Project[bag][2] - scope-95
|
|---C: Package(Packager)[tuple]{int} - scope-89
|
|---C: Global Rearrange[tuple] - scope-88
|
|---C: Local Rearrange[tuple]{int}(false) - scope-90
| | |
| | Project[int][0] - scope-91
| |
| |---A: New For Each(false,false,false)[bag] - scope-119
| | |
| | Cast[int] - scope-114
| | |
| | |---Project[bytearray][0] - scope-113
| | |
| | Cast[int] - scope-116
| | |
| | |---Project[bytearray][1] - scope-115
| | |
| | Cast[int] - scope-118
| | |
| | |---Project[bytearray][2] - scope-117
| |
| |---A:
Load(hdfs://zly1.sh.intel.com:8020/user/root/split5:org.apache.pig.builtin.PigStorage)
- scope-112
|
|---C: Local Rearrange[tuple]{int}(false) - scope-92
| |
| Project[int][0] - scope-93
|
|---B: New For Each(false,false)[bag] - scope-85
| |
| Project[int][0] - scope-81
| |
| Project[int][1] - scope-83
|
|---A: New For Each(false,false,false)[bag] - scope-128
| |
| Cast[int] - scope-123
| |
| |---Project[bytearray][0] - scope-122
| |
| Cast[int] - scope-125
| |
| |---Project[bytearray][1] - scope-124
| |
| Cast[int] - scope-127
| |
| |---Project[bytearray][2] - scope-126
|
|---A:
Load(hdfs://zly1.sh.intel.com:8020/user/root/split5:org.apache.pig.builtin.PigStorage)
- scope-121--------
{code}
we just copy subplan(Load:scope-121) of SparkOperator(scope-102
> Not use OperatorPlan#forceConnect in MultiQueryOptimizationSpark
> -----------------------------------------------------------------
>
> Key: PIG-4871
> URL: https://issues.apache.org/jira/browse/PIG-4871
> Project: Pig
> Issue Type: Sub-task
> Components: spark
> Reporter: liyunzhang_intel
> Assignee: liyunzhang_intel
> Fix For: spark-branch
>
>
> In current code base, we use OperatorPlan#forceConnect() while merge the
> physical plan of spliter and splittee in MultiQueryOptimizationSpark.
> The difference between OperatorPlan#connect and OperatorPlan#forceConnect is
> not checking whether support multiOutputs and multiInputs or not in
> forceConnect.
> {code}
> /**
> * connect from and to and ignore some judgements: like ignoring judge
> whether from operator supports multiOutputs
> * and whether to operator supports multiInputs
> *
> * @param from Operator data will flow from.
> * @param to Operator data will flow to.
> * @throws PlanException if connect from or to which is not in the plan
> */
> public void forceConnect(E from, E to) throws PlanException {
> markDirty();
> // Check that both nodes are in the plan.
> checkInPlan(from);
> checkInPlan(to);
> mFromEdges.put(from, to);
> mToEdges.put(to, from);
> }
> {code}
> Let's use an example to explain why add forceConnect before.
> {code}
> before multiquery optimization:
> scope-102->scope-108
> scope-108
> #--------------------------------------------------
> # Spark Plan
> #--------------------------------------------------
> Spark node scope-102
> Store(hdfs://zly1.sh.intel.com:8020/tmp/temp1293488480/tmp128312889:org.apache.pig.impl.io.InterStorage)
> - scope-103
> |
> |---A: New For Each(false,false,false)[bag] - scope-75
> | |
> | Cast[int] - scope-67
> | |
> | |---Project[bytearray][0] - scope-66
> | |
> | Cast[int] - scope-70
> | |
> | |---Project[bytearray][1] - scope-69
> | |
> | Cast[int] - scope-73
> | |
> | |---Project[bytearray][2] - scope-72
> |
> |---A:
> Load(hdfs://zly1.sh.intel.com:8020/user/root/split5:org.apache.pig.builtin.PigStorage)
> - scope-65--------
> Spark node scope-108
> D:
> Store(hdfs://zly1.sh.intel.com:8020/user/root/split5.out:org.apache.pig.builtin.PigStorage)
> - scope-101
> |
> |---D: Filter[bag] - scope-97
> | |
> | Greater Than or Equal[boolean] - scope-100
> | |
> | |---Project[int][1] - scope-98
> | |
> | |---Project[int][4] - scope-99
> |
> |---C: New For Each(true,true)[tuple] - scope-96
> | |
> | Project[bag][1] - scope-94
> | |
> | Project[bag][2] - scope-95
> |
> |---C: Package(Packager)[tuple]{int} - scope-89
> |
> |---C: Global Rearrange[tuple] - scope-88
> |
> |---C: Local Rearrange[tuple]{int}(false) - scope-90
> | | |
> | | Project[int][0] - scope-91
> | |
> |
> |---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp1293488480/tmp128312889:org.apache.pig.impl.io.InterStorage)
> - scope-104
> |
> |---C: Local Rearrange[tuple]{int}(false) - scope-92
> | |
> | Project[int][0] - scope-93
> |
> |---B: New For Each(false,false)[bag] - scope-85
> | |
> | Project[int][0] - scope-81
> | |
> | Project[int][1] - scope-83
> |
>
> |---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp1293488480/tmp128312889:org.apache.pig.impl.io.InterStorage)
> - scope-106--------
> after multiquery optimization:
> scope-108
> #--------------------------------------------------
> # Spark Plan
> #--------------------------------------------------
> Spark node scope-108
> D:
> Store(hdfs://zly1.sh.intel.com:8020/user/root/split5.out:org.apache.pig.builtin.PigStorage)
> - scope-101
> |
> |---D: Filter[bag] - scope-97
> | |
> | Greater Than or Equal[boolean] - scope-100
> | |
> | |---Project[int][1] - scope-98
> | |
> | |---Project[int][4] - scope-99
> |
> |---C: New For Each(true,true)[tuple] - scope-96
> | |
> | Project[bag][1] - scope-94
> | |
> | Project[bag][2] - scope-95
> |
> |---C: Package(Packager)[tuple]{int} - scope-89
> |
> |---C: Global Rearrange[tuple] - scope-88
> |
> |---C: Local Rearrange[tuple]{int}(false) - scope-90
> | | |
> | | Project[int][0] - scope-91
> | |
> | |---A: New For Each(false,false,false)[bag] - scope-119
> | | |
> | | Cast[int] - scope-114
> | | |
> | | |---Project[bytearray][0] - scope-113
> | | |
> | | Cast[int] - scope-116
> | | |
> | | |---Project[bytearray][1] - scope-115
> | | |
> | | Cast[int] - scope-118
> | | |
> | | |---Project[bytearray][2] - scope-117
> | |
> | |---A:
> Load(hdfs://zly1.sh.intel.com:8020/user/root/split5:org.apache.pig.builtin.PigStorage)
> - scope-112
> |
> |---C: Local Rearrange[tuple]{int}(false) - scope-92
> | |
> | Project[int][0] - scope-93
> |
> |---B: New For Each(false,false)[bag] - scope-85
> | |
> | Project[int][0] - scope-81
> | |
> | Project[int][1] - scope-83
> |
> |---A: New For Each(false,false,false)[bag] -
> scope-128
> | |
> | Cast[int] - scope-123
> | |
> | |---Project[bytearray][0] - scope-122
> | |
> | Cast[int] - scope-125
> | |
> | |---Project[bytearray][1] - scope-124
> | |
> | Cast[int] - scope-127
> | |
> | |---Project[bytearray][2] - scope-126
> |
> |---A:
> Load(hdfs://zly1.sh.intel.com:8020/user/root/split5:org.apache.pig.builtin.PigStorage)
> - scope-121--------
> {code}
> we just copy subplan(Load:scope-121) of SparkOperator(scope-102) to
> SparkOperator(scope-108), connect POLoad(scope-121) with POForEach(scope-119)
> and POForEach(scope-128). Here POLoad supports multiOutputs. we can use
> OperatorPlan#connect. If a physical operator does not support multiOutputs,
> we have to use OperatorPlan#forceConnect to do the connection. Now we can
> copy the subplan and generate a new subplan and merge the subplan to the
> final plan to do multiquery optimization like following to solve this problem.
> {code}
> scope-108
> #--------------------------------------------------
> # Spark Plan
> #--------------------------------------------------
> Spark node scope-108
> D:
> Store(hdfs://zly1.sh.intel.com:8020/user/root/split5.out:org.apache.pig.builtin.PigStorage)
> - scope-101
> |
> |---D: Filter[bag] - scope-97
> | |
> | Greater Than or Equal[boolean] - scope-100
> | |
> | |---Project[int][1] - scope-98
> | |
> | |---Project[int][4] - scope-99
> |
> |---C: New For Each(true,true)[tuple] - scope-96
> | |
> | Project[bag][1] - scope-94
> | |
> | Project[bag][2] - scope-95
> |
> |---C: Package(Packager)[tuple]{int} - scope-89
> |
> |---C: Global Rearrange[tuple] - scope-88
> |
> |---C: Local Rearrange[tuple]{int}(false) - scope-90
> | | |
> | | Project[int][0] - scope-91
> | |
> | |---A: New For Each(false,false,false)[bag] - scope-119
> | | |
> | | Cast[int] - scope-114
> | | |
> | | |---Project[bytearray][0] - scope-113
> | | |
> | | Cast[int] - scope-116
> | | |
> | | |---Project[bytearray][1] - scope-115
> | | |
> | | Cast[int] - scope-118
> | | |
> | | |---Project[bytearray][2] - scope-117
> | |
> | |---A:
> Load(hdfs://zly1.sh.intel.com:8020/user/root/split5:org.apache.pig.builtin.PigStorage)
> - scope-112
> |
> |---C: Local Rearrange[tuple]{int}(false) - scope-92
> | |
> | Project[int][0] - scope-93
> |
> |---B: New For Each(false,false)[bag] - scope-85
> | |
> | Project[int][0] - scope-81
> | |
> | Project[int][1] - scope-83
> |
> |---A: New For Each(false,false,false)[bag] -
> scope-128
> | |
> | Cast[int] - scope-123
> | |
> | |---Project[bytearray][0] - scope-122
> | |
> | Cast[int] - scope-125
> | |
> | |---Project[bytearray][1] - scope-124
> | |
> | Cast[int] - scope-127
> | |
> | |---Project[bytearray][2] - scope-126
> |
> |---A:
> Load(hdfs://zly1.sh.intel.com:8020/user/root/split5:org.apache.pig.builtin.PigStorage)
> - scope-132--------
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)