Github user viirya commented on the issue:
https://github.com/apache/spark/pull/21952
Ah, finally I can reproduce this. It needs to allocate the array feature
with length 16000. I was reducing it to 1600 and it largely relieve the
regression. `com.databricks.spark.avro` is faster only on Spark 2.3. If using
with current master branch, it isn't faster than built-in avro datasource.
Maybe somewhere causes this regression.
```scala
> "com.databricks.spark.avro - Spark 2.3"
scala> spark.sparkContext.parallelize(writeTimes.slice(50,
150)).toDF("writeTimes").describe("writeTimes").show()
+-------+-------------------+
|summary| writeTimes|
+-------+-------------------+
| count| 100|
| mean| 0.9711099999999999|
| stddev|0.01940836797556013|
| min| 0.941|
| max| 1.037|
+-------+-------------------+
scala> spark.sparkContext.parallelize(readTimes.slice(50,
150)).toDF("readTimes").describe("readTimes").show()
+-------+-------------------+
|summary| readTimes|
+-------+-------------------+
| count| 100|
| mean| 0.36022|
| stddev|0.05807476546520342|
| min| 0.287|
| max| 0.626|
+-------+-------------------+
> "avro"
scala> spark.sparkContext.parallelize(writeTimes.slice(50,
150)).toDF("writeTimes").describe("writeTimes").show()
+-------+-------------------+
|summary| writeTimes|
+-------+-------------------+
| count| 100|
| mean| 1.7371699999999999|
| stddev|0.03504399976018602|
| min| 1.695|
| max| 1.886|
+-------+-------------------+
scala> spark.sparkContext.parallelize(readTimes.slice(50,
150)).toDF("readTimes").describe("readTimes").show()
+-------+-------------------+
|summary| readTimes|
+-------+-------------------+
| count| 100|
| mean|0.32348999999999994|
| stddev|0.06235617714615632|
| min| 0.263|
| max| 0.781|
+-------+-------------------+
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]