wangyum commented on pull request #29642:
URL: https://github.com/apache/spark/pull/29642#issuecomment-730869008
It seems only Parquet not well supported `In` predicate pushdown. @MaxGekk
What do you think?
This is the benchmark of CSV:
```scala
val rowsNum = 100 * 1000
val numIters = 3
val colsNum = 100
val fields = Seq.tabulate(colsNum)(i => StructField(s"col$i", TimestampType))
val schema = StructType(StructField("key", IntegerType) +: fields)
def columns(): Seq[Column] = {
val ts = Seq.tabulate(colsNum) { i =>
lit(Instant.ofEpochSecond(i * 12345678)).as(s"col$i")
}
($"id" % 1000).as("key") +: ts
}
withTempPath { path =>
spark.range(rowsNum).select(columns(): _*)
.write.option("header", true)
.csv(path.getAbsolutePath)
def readback = {
spark.read
.option("header", true)
.schema(schema)
.csv(path.getAbsolutePath)
}
def withFilter(filer: String, configEnabled: Boolean): Unit = {
withSQLConf(SQLConf.CSV_FILTER_PUSHDOWN_ENABLED.key ->
configEnabled.toString()) {
readback.filter(filer).noop()
}
}
Seq(5, 10, 50, 100, 500).foreach { count =>
Seq(10, 50).foreach { distribution =>
val title = s"InSet -> InFilters (values count: $count, distribution:
$distribution)"
val benchmark = new Benchmark(title, rowsNum, output = output)
Seq(false, true).foreach { pushDownEnabled =>
val name = s"Native CSV Vectorized ${if (pushDownEnabled)
s"(Pushdown)" else ""}"
benchmark.addCase(name, numIters) { _ =>
val filter =
Range(0, count).map(_ => scala.util.Random.nextInt(rowsNum *
distribution / 100))
val whereExpr = s"key in(${filter.mkString(",")})"
withFilter(whereExpr, configEnabled = pushDownEnabled)
}
}
benchmark.run()
}
}
}
```
Result:
```
================================================================================================
Benchmark to measure CSV read performance
================================================================================================
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 5, distribution: 10): Best Time(ms) Avg
Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 13082
17077 1674 0.0 130815.6 1.0X
Native CSV Vectorized (Pushdown) 1172
1192 35 0.1 11719.5 11.2X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 5, distribution: 50): Best Time(ms) Avg
Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11858
12028 237 0.0 118576.9 1.0X
Native CSV Vectorized (Pushdown) 1165
1172 6 0.1 11652.4 10.2X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 10, distribution: 10): Best Time(ms)
Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11883
12180 494 0.0 118834.3 1.0X
Native CSV Vectorized (Pushdown) 1142
1156 21 0.1 11418.6 10.4X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 10, distribution: 50): Best Time(ms)
Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11857
11878 19 0.0 118570.4 1.0X
Native CSV Vectorized (Pushdown) 1169
1174 7 0.1 11692.9 10.1X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 50, distribution: 10): Best Time(ms)
Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11923
11962 66 0.0 119228.0 1.0X
Native CSV Vectorized (Pushdown) 1196
1225 26 0.1 11960.7 10.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 50, distribution: 50): Best Time(ms)
Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11910
11917 7 0.0 119095.3 1.0X
Native CSV Vectorized (Pushdown) 1191
1194 5 0.1 11908.0 10.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 100, distribution: 10): Best Time(ms)
Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11948
12097 201 0.0 119484.5 1.0X
Native CSV Vectorized (Pushdown) 1250
1284 32 0.1 12501.4 9.6X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 100, distribution: 50): Best Time(ms)
Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11938
11978 39 0.0 119378.8 1.0X
Native CSV Vectorized (Pushdown) 1176
1188 11 0.1 11756.0 10.2X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 500, distribution: 10): Best Time(ms)
Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11954
12051 124 0.0 119542.9 1.0X
Native CSV Vectorized (Pushdown) 1762
1833 104 0.1 17620.6 6.8X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 500, distribution: 50): Best Time(ms)
Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
----------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized 11860
12166 484 0.0 118597.8 1.0X
Native CSV Vectorized (Pushdown) 1417
1434 15 0.1 14171.7 8.4X
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]