GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/14660
[SPARK-17071][SQL] Fetch Parquet schema without another Spark job when it is a single file to touch ## What changes were proposed in this pull request? It seems Spark executes another job to figure out schema always ([ParquetFileFormat#L739-L778](https://github.com/apache/spark/blob/abff92bfdc7d4c9d2308794f0350561fe0ceb4dd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L739-L778)). However, it seems it's a bit of overhead to touch only a single file. I ran a bench mark with the code below: ```scala test("Benchmark for JSON writer") { withTempPath { path => Seq((1, 2D, 3L, "4")).toDF("a", "b", "c", "d") .write.format("parquet").save(path.getAbsolutePath) val benchmark = new Benchmark("Parquet - read schema", 1) benchmark.addCase("Parquet - read schema", 10) { _ => spark.read.format("parquet").load(path.getCanonicalPath).schema } benchmark.run() } } ``` with the results as below: - **Before** ```scala Parquet - read schema: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Parquet - read schema 47 / 49 0.0 46728419.0 1.0X ``` - **After** ```scala Parquet - read schema: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Parquet - read schema 2 / 3 0.0 1811673.0 1.0X ``` It seems it became 20X faster (although It is a small bit in total job run-time). As a reference, it seems ORC is doing this within driver-side [OrcFileOperator.scala#L74-L83](https://github.com/apache/spark/blob/a95252823e09939b654dd425db38dadc4100bc87/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala#L74-L83). ## How was this patch tested? Existing tests should cover this You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-17071 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14660.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14660 ---- commit 614abbc6b7a03ff0d3e505697c0bbfec3b330c2b Author: hyukjinkwon <gurwls...@gmail.com> Date: 2016-08-16T05:42:29Z Fetch Parquet schema within driver-side when there is single file to touch without another Spark job commit e1214d50035441fb96551683cf38ae3e49f07b7d Author: hyukjinkwon <gurwls...@gmail.com> Date: 2016-08-16T05:46:12Z Fix modifier ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org