[jira] [Updated] (SPARK-30768) Constraints should be inferred from inequality attributes

Yuming Wang (Jira) Sun, 09 Feb 2020 21:54:58 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-30768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yuming Wang updated SPARK-30768:
--------------------------------
    Description: 
How to reproduce:
{code:sql}
create table SPARK_30768_1(c1 int, c2 int);
create table SPARK_30768_2(c1 int, c2 int);
{code}


*Spark SQL*:
{noformat}
spark-sql> explain select t1.* from SPARK_30768_1 t1 join SPARK_30768_2 t2 on 
(t1.c1 > t2.c1) where t1.c1 = 3;
== Physical Plan ==
*(3) Project [c1#5, c2#6]
+- BroadcastNestedLoopJoin BuildRight, Inner, (c1#5 > c1#7)
   :- *(1) Project [c1#5, c2#6]
   :  +- *(1) Filter (isnotnull(c1#5) AND (c1#5 = 3))
   :     +- *(1) ColumnarToRow
   :        +- FileScan parquet default.spark_30768_1[c1#5,c2#6] Batched: true, 
DataFilters: [isnotnull(c1#5), (c1#5 = 3)], Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
 PartitionFilters: [], PushedFilters: [IsNotNull(c1), EqualTo(c1,3)], 
ReadSchema: struct<c1:int,c2:int>
   +- BroadcastExchange IdentityBroadcastMode, [id=#60]
      +- *(2) Project [c1#7]
         +- *(2) Filter isnotnull(c1#7)
            +- *(2) ColumnarToRow
               +- FileScan parquet default.spark_30768_2[c1#7] Batched: true, 
DataFilters: [isnotnull(c1#7)], Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
 PartitionFilters: [], PushedFilters: [IsNotNull(c1)], ReadSchema: 
struct<c1:int>
{noformat}

*Hive* support this feature:
{noformat}
hive> explain select t1.* from SPARK_30768_1 t1 join SPARK_30768_2 t2 on (t1.c1 
> t2.c1) where t1.c1 = 3;
Warning: Map Join MAPJOIN[13][bigTable=?] in task 'Stage-3:MAPRED' is a cross 
product
OK
STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        $hdt$_0:t1
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        $hdt$_0:t1
          TableScan
            alias: t1
            filterExpr: (c1 = 3) (type: boolean)
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
            Filter Operator
              predicate: (c1 = 3) (type: boolean)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
              Select Operator
                expressions: c2 (type: int)
                outputColumnNames: _col1
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
Column stats: NONE
                HashTable Sink Operator
                  keys:
                    0
                    1

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t2
            filterExpr: (c1 < 3) (type: boolean)
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
            Filter Operator
              predicate: (c1 < 3) (type: boolean)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
              Select Operator
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
Column stats: NONE
                Map Join Operator
                  condition map:
                       Inner Join 0 to 1
                  keys:
                    0
                    1
                  outputColumnNames: _col1
                  Statistics: Num rows: 1 Data size: 1 Basic stats: PARTIAL 
Column stats: NONE
                  Select Operator
                    expressions: 3 (type: int), _col1 (type: int)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 1 Data size: 1 Basic stats: PARTIAL 
Column stats: NONE
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 1 Data size: 1 Basic stats: PARTIAL 
Column stats: NONE
                      table:
                          input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                          serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Execution mode: vectorized
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 5.491 seconds, Fetched: 71 row(s)
{noformat}


> Constraints should be inferred from inequality attributes
> ---------------------------------------------------------
>
>                 Key: SPARK-30768
>                 URL: https://issues.apache.org/jira/browse/SPARK-30768
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Yuming Wang
>            Priority: Major
>
> How to reproduce:
> {code:sql}
> create table SPARK_30768_1(c1 int, c2 int);
> create table SPARK_30768_2(c1 int, c2 int);
> {code}
> *Spark SQL*:
> {noformat}
> spark-sql> explain select t1.* from SPARK_30768_1 t1 join SPARK_30768_2 t2 on 
> (t1.c1 > t2.c1) where t1.c1 = 3;
> == Physical Plan ==
> *(3) Project [c1#5, c2#6]
> +- BroadcastNestedLoopJoin BuildRight, Inner, (c1#5 > c1#7)
>    :- *(1) Project [c1#5, c2#6]
>    :  +- *(1) Filter (isnotnull(c1#5) AND (c1#5 = 3))
>    :     +- *(1) ColumnarToRow
>    :        +- FileScan parquet default.spark_30768_1[c1#5,c2#6] Batched: 
> true, DataFilters: [isnotnull(c1#5), (c1#5 = 3)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(c1), EqualTo(c1,3)], 
> ReadSchema: struct<c1:int,c2:int>
>    +- BroadcastExchange IdentityBroadcastMode, [id=#60]
>       +- *(2) Project [c1#7]
>          +- *(2) Filter isnotnull(c1#7)
>             +- *(2) ColumnarToRow
>                +- FileScan parquet default.spark_30768_2[c1#7] Batched: true, 
> DataFilters: [isnotnull(c1#7)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(c1)], ReadSchema: 
> struct<c1:int>
> {noformat}
> *Hive* support this feature:
> {noformat}
> hive> explain select t1.* from SPARK_30768_1 t1 join SPARK_30768_2 t2 on 
> (t1.c1 > t2.c1) where t1.c1 = 3;
> Warning: Map Join MAPJOIN[13][bigTable=?] in task 'Stage-3:MAPRED' is a cross 
> product
> OK
> STAGE DEPENDENCIES:
>   Stage-4 is a root stage
>   Stage-3 depends on stages: Stage-4
>   Stage-0 depends on stages: Stage-3
> STAGE PLANS:
>   Stage: Stage-4
>     Map Reduce Local Work
>       Alias -> Map Local Tables:
>         $hdt$_0:t1
>           Fetch Operator
>             limit: -1
>       Alias -> Map Local Operator Tree:
>         $hdt$_0:t1
>           TableScan
>             alias: t1
>             filterExpr: (c1 = 3) (type: boolean)
>             Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
> stats: NONE
>             Filter Operator
>               predicate: (c1 = 3) (type: boolean)
>               Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
> Column stats: NONE
>               Select Operator
>                 expressions: c2 (type: int)
>                 outputColumnNames: _col1
>                 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
> Column stats: NONE
>                 HashTable Sink Operator
>                   keys:
>                     0
>                     1
>   Stage: Stage-3
>     Map Reduce
>       Map Operator Tree:
>           TableScan
>             alias: t2
>             filterExpr: (c1 < 3) (type: boolean)
>             Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
> stats: NONE
>             Filter Operator
>               predicate: (c1 < 3) (type: boolean)
>               Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
> Column stats: NONE
>               Select Operator
>                 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
> Column stats: NONE
>                 Map Join Operator
>                   condition map:
>                        Inner Join 0 to 1
>                   keys:
>                     0
>                     1
>                   outputColumnNames: _col1
>                   Statistics: Num rows: 1 Data size: 1 Basic stats: PARTIAL 
> Column stats: NONE
>                   Select Operator
>                     expressions: 3 (type: int), _col1 (type: int)
>                     outputColumnNames: _col0, _col1
>                     Statistics: Num rows: 1 Data size: 1 Basic stats: PARTIAL 
> Column stats: NONE
>                     File Output Operator
>                       compressed: false
>                       Statistics: Num rows: 1 Data size: 1 Basic stats: 
> PARTIAL Column stats: NONE
>                       table:
>                           input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                           output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                           serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>       Execution mode: vectorized
>       Local Work:
>         Map Reduce Local Work
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
> Time taken: 5.491 seconds, Fetched: 71 row(s)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-30768) Constraints should be inferred from inequality attributes

Reply via email to