[ https://issues.apache.org/jira/browse/PARQUET-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yuming Wang updated PARQUET-1355: --------------------------------- Description: *Benchmark code*: {code:java} test("Parquet write benchmark") { val count = 100 * 1024 * 1024 val numIters = 5 withTempPath { path => val benchmark = new Benchmark(s"Parquet write benchmark ${spark.sparkContext.version}", 5) Seq("long", "string", "decimal(18, 0)", "decimal(38, 18)").foreach { dt => benchmark.addCase(s"$dt type", numIters = numIters) { iter => spark.range(count).selectExpr(s"cast(id as $dt) as id") .write.mode("overwrite").parquet(path.getAbsolutePath) } } benchmark.run() } } {code} *Result*: {noformat} -- Spark 2.3.3-SNAPSHOT with Parquet 1.8.3 Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz Parquet write benchmark 2.3.3-SNAPSHOT: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ long type 10963 / 11344 0.0 2192675973.8 1.0X string type 28423 / 29437 0.0 5684553922.2 0.4X decimal(18, 0) type 11558 / 11696 0.0 2311587203.6 0.9X decimal(38, 18) type 43858 / 44432 0.0 8771537663.4 0.2X -- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0 Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz Parquet write benchmark 2.4.0-SNAPSHOT: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ long type 11633 / 12070 0.0 2326572295.8 1.0X string type 31374 / 32178 0.0 6274760187.4 0.4X decimal(18, 0) type 13019 / 13294 0.0 2603841925.4 0.9X decimal(38, 18) type 50719 / 50983 0.0 10143775007.6 0.2X {noformat} The mainly is [toByteBuffer|https://github.com/apache/parquet-mr/blob/d61d221c9e752ce2cc0da65ede8b55653b3ae21f/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L83] affects performance. If do not use the {{toByteBuffer}} when compare binary, the result is: {noformat} -- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0 Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz Parquet write benchmark 2.4.0-SNAPSHOT: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ long type 11171 / 11508 0.0 2234189382.0 1.0X string type 30072 / 30290 0.0 6014346455.4 0.4X decimal(18, 0) type 12150 / 12239 0.0 2430052708.8 0.9X decimal(38, 18) type 44974 / 45423 0.0 8994773738.8 0.2X {noformat} was: *Benchmark code*: {code:java} test("Parquet write benchmark") { val count = 100 * 1024 * 1024 val numIters = 5 withTempPath { path => val benchmark = new Benchmark(s"Parquet write benchmark ${spark.sparkContext.version}", 5) Seq("long", "string", "decimal(18, 0)", "decimal(38, 18)", "timestamp").foreach { dt => benchmark.addCase(s"$dt type", numIters = numIters) { iter => spark.range(count).selectExpr(s"cast(id as $dt) as id") .write.mode("overwrite").parquet(path.getAbsolutePath) } } benchmark.run() } } {code} *Result*: {noformat} -- Spark 2.3.3-SNAPSHOT with Parquet 1.8.3 Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz Parquet write benchmark 2.3.3-SNAPSHOT: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ long type 10963 / 11344 0.0 2192675973.8 1.0X string type 28423 / 29437 0.0 5684553922.2 0.4X decimal(18, 0) type 11558 / 11696 0.0 2311587203.6 0.9X decimal(38, 18) type 43858 / 44432 0.0 8771537663.4 0.2X -- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0 Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz Parquet write benchmark 2.4.0-SNAPSHOT: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ long type 11633 / 12070 0.0 2326572295.8 1.0X string type 31374 / 32178 0.0 6274760187.4 0.4X decimal(18, 0) type 13019 / 13294 0.0 2603841925.4 0.9X decimal(38, 18) type 50719 / 50983 0.0 10143775007.6 0.2X {noformat} The mainly is [toByteBuffer|https://github.com/apache/parquet-mr/blob/d61d221c9e752ce2cc0da65ede8b55653b3ae21f/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L83] affects performance. If do not use the {{toByteBuffer}} when compare binary, the result is: {noformat} -- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0 Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz Parquet write benchmark 2.4.0-SNAPSHOT: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ long type 11171 / 11508 0.0 2234189382.0 1.0X string type 30072 / 30290 0.0 6014346455.4 0.4X decimal(18, 0) type 12150 / 12239 0.0 2430052708.8 0.9X decimal(38, 18) type 44974 / 45423 0.0 8994773738.8 0.2X {noformat} > Improvement parquet Binary write performance > -------------------------------------------- > > Key: PARQUET-1355 > URL: https://issues.apache.org/jira/browse/PARQUET-1355 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr > Affects Versions: 1.10.0 > Reporter: Yuming Wang > Assignee: Yuming Wang > Priority: Major > > *Benchmark code*: > {code:java} > test("Parquet write benchmark") { > val count = 100 * 1024 * 1024 > val numIters = 5 > withTempPath { path => > val benchmark = new Benchmark(s"Parquet write benchmark > ${spark.sparkContext.version}", 5) > Seq("long", "string", "decimal(18, 0)", "decimal(38, 18)").foreach { dt => > benchmark.addCase(s"$dt type", numIters = numIters) { iter => > spark.range(count).selectExpr(s"cast(id as $dt) as id") > .write.mode("overwrite").parquet(path.getAbsolutePath) > } > } > benchmark.run() > } > } > {code} > *Result*: > {noformat} > -- Spark 2.3.3-SNAPSHOT with Parquet 1.8.3 > Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 > Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz > Parquet write benchmark 2.3.3-SNAPSHOT: Best/Avg Time(ms) Rate(M/s) Per > Row(ns) Relative > ------------------------------------------------------------------------------------------------ > long type 10963 / 11344 0.0 > 2192675973.8 1.0X > string type 28423 / 29437 0.0 > 5684553922.2 0.4X > decimal(18, 0) type 11558 / 11696 0.0 > 2311587203.6 0.9X > decimal(38, 18) type 43858 / 44432 0.0 > 8771537663.4 0.2X > -- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0 > Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 > Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz > Parquet write benchmark 2.4.0-SNAPSHOT: Best/Avg Time(ms) Rate(M/s) Per > Row(ns) Relative > ------------------------------------------------------------------------------------------------ > long type 11633 / 12070 0.0 > 2326572295.8 1.0X > string type 31374 / 32178 0.0 > 6274760187.4 0.4X > decimal(18, 0) type 13019 / 13294 0.0 > 2603841925.4 0.9X > decimal(38, 18) type 50719 / 50983 0.0 > 10143775007.6 0.2X > {noformat} > The mainly is > [toByteBuffer|https://github.com/apache/parquet-mr/blob/d61d221c9e752ce2cc0da65ede8b55653b3ae21f/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L83] > affects performance. > If do not use the {{toByteBuffer}} when compare binary, the result is: > {noformat} > -- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0 > Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 > Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz > Parquet write benchmark 2.4.0-SNAPSHOT: Best/Avg Time(ms) Rate(M/s) Per > Row(ns) Relative > ------------------------------------------------------------------------------------------------ > long type 11171 / 11508 0.0 > 2234189382.0 1.0X > string type 30072 / 30290 0.0 > 6014346455.4 0.4X > decimal(18, 0) type 12150 / 12239 0.0 > 2430052708.8 0.9X > decimal(38, 18) type 44974 / 45423 0.0 > 8994773738.8 0.2X > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)