wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-730869008


   It seems only Parquet not well supported `In` predicate pushdown. @MaxGekk 
What do you think?
   
   This is the benchmark of CSV:
   ```scala
   val rowsNum = 100 * 1000
   val numIters = 3
   val colsNum = 100
   val fields = Seq.tabulate(colsNum)(i => StructField(s"col$i", TimestampType))
   val schema = StructType(StructField("key", IntegerType) +: fields)
   def columns(): Seq[Column] = {
     val ts = Seq.tabulate(colsNum) { i =>
       lit(Instant.ofEpochSecond(i * 12345678)).as(s"col$i")
     }
     ($"id" % 1000).as("key") +: ts
   }
   withTempPath { path =>
     spark.range(rowsNum).select(columns(): _*)
       .write.option("header", true)
       .csv(path.getAbsolutePath)
     def readback = {
       spark.read
         .option("header", true)
         .schema(schema)
         .csv(path.getAbsolutePath)
     }
   
     def withFilter(filer: String, configEnabled: Boolean): Unit = {
       withSQLConf(SQLConf.CSV_FILTER_PUSHDOWN_ENABLED.key -> 
configEnabled.toString()) {
         readback.filter(filer).noop()
       }
     }
   
     Seq(5, 10, 50, 100, 500).foreach { count =>
       Seq(10, 50).foreach { distribution =>
         val title = s"InSet -> InFilters (values count: $count, distribution: 
$distribution)"
         val benchmark = new Benchmark(title, rowsNum, output = output)
         Seq(false, true).foreach { pushDownEnabled =>
           val name = s"Native CSV Vectorized ${if (pushDownEnabled) 
s"(Pushdown)" else ""}"
           benchmark.addCase(name, numIters) { _ =>
             val filter =
               Range(0, count).map(_ => scala.util.Random.nextInt(rowsNum * 
distribution / 100))
             val whereExpr = s"key in(${filter.mkString(",")})"
             withFilter(whereExpr, configEnabled = pushDownEnabled)
           }
         }
         benchmark.run()
       }
     }
   }
   ```
   
   Result:
   ```
   
================================================================================================
   Benchmark to measure CSV read performance
   
================================================================================================
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 5, distribution: 10):  Best Time(ms)   Avg 
Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
--------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                           13082        
  17077        1674          0.0      130815.6       1.0X
   Native CSV Vectorized (Pushdown)                                 1172        
   1192          35          0.1       11719.5      11.2X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 5, distribution: 50):  Best Time(ms)   Avg 
Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
--------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                           11858        
  12028         237          0.0      118576.9       1.0X
   Native CSV Vectorized (Pushdown)                                 1165        
   1172           6          0.1       11652.4      10.2X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 10, distribution: 10):  Best Time(ms)   
Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
---------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                            11883       
   12180         494          0.0      118834.3       1.0X
   Native CSV Vectorized (Pushdown)                                  1142       
    1156          21          0.1       11418.6      10.4X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 10, distribution: 50):  Best Time(ms)   
Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
---------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                            11857       
   11878          19          0.0      118570.4       1.0X
   Native CSV Vectorized (Pushdown)                                  1169       
    1174           7          0.1       11692.9      10.1X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 50, distribution: 10):  Best Time(ms)   
Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
---------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                            11923       
   11962          66          0.0      119228.0       1.0X
   Native CSV Vectorized (Pushdown)                                  1196       
    1225          26          0.1       11960.7      10.0X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 50, distribution: 50):  Best Time(ms)   
Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
---------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                            11910       
   11917           7          0.0      119095.3       1.0X
   Native CSV Vectorized (Pushdown)                                  1191       
    1194           5          0.1       11908.0      10.0X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 100, distribution: 10):  Best Time(ms)   
Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
----------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                             11948      
    12097         201          0.0      119484.5       1.0X
   Native CSV Vectorized (Pushdown)                                   1250      
     1284          32          0.1       12501.4       9.6X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 100, distribution: 50):  Best Time(ms)   
Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
----------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                             11938      
    11978          39          0.0      119378.8       1.0X
   Native CSV Vectorized (Pushdown)                                   1176      
     1188          11          0.1       11756.0      10.2X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 500, distribution: 10):  Best Time(ms)   
Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
----------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                             11954      
    12051         124          0.0      119542.9       1.0X
   Native CSV Vectorized (Pushdown)                                   1762      
     1833         104          0.1       17620.6       6.8X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   InSet -> InFilters (values count: 500, distribution: 50):  Best Time(ms)   
Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
----------------------------------------------------------------------------------------------------------------------------------------
   Native CSV Vectorized                                             11860      
    12166         484          0.0      118597.8       1.0X
   Native CSV Vectorized (Pushdown)                                   1417      
     1434          15          0.1       14171.7       8.4X
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to