[
https://issues.apache.org/jira/browse/PARQUET-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated PARQUET-1355:
------------------------------------
Labels: pull-request-available (was: )
> Improvement Binary write performance
> ------------------------------------
>
> Key: PARQUET-1355
> URL: https://issues.apache.org/jira/browse/PARQUET-1355
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-mr
> Affects Versions: 1.10.0
> Reporter: Yuming Wang
> Assignee: Yuming Wang
> Priority: Major
> Labels: pull-request-available
>
> *Benchmark code*:
> {code:java}
> test("Parquet write benchmark") {
> val count = 100 * 1024 * 1024
> val numIters = 5
> withTempPath { path =>
> val benchmark = new Benchmark(s"Parquet write benchmark
> ${spark.sparkContext.version}", 5)
> Seq("long", "string", "decimal(18, 0)", "decimal(38, 18)").foreach { dt =>
> benchmark.addCase(s"$dt type", numIters = numIters) { iter =>
> spark.range(count).selectExpr(s"cast(id as $dt) as id")
> .write.mode("overwrite").parquet(path.getAbsolutePath)
> }
> }
> benchmark.run()
> }
> }
> {code}
> *Result*:
> {noformat}
> -- Spark 2.3.3-SNAPSHOT with Parquet 1.8.3
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
> Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
> Parquet write benchmark 2.3.3-SNAPSHOT: Best/Avg Time(ms) Rate(M/s) Per
> Row(ns) Relative
> ------------------------------------------------------------------------------------------------
> long type 10963 / 11344 0.0
> 2192675973.8 1.0X
> string type 28423 / 29437 0.0
> 5684553922.2 0.4X
> decimal(18, 0) type 11558 / 11696 0.0
> 2311587203.6 0.9X
> decimal(38, 18) type 43858 / 44432 0.0
> 8771537663.4 0.2X
> -- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
> Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
> Parquet write benchmark 2.4.0-SNAPSHOT: Best/Avg Time(ms) Rate(M/s) Per
> Row(ns) Relative
> ------------------------------------------------------------------------------------------------
> long type 11633 / 12070 0.0
> 2326572295.8 1.0X
> string type 31374 / 32178 0.0
> 6274760187.4 0.4X
> decimal(18, 0) type 13019 / 13294 0.0
> 2603841925.4 0.9X
> decimal(38, 18) type 50719 / 50983 0.0
> 10143775007.6 0.2X
> {noformat}
> The mainly is
> [toByteBuffer|https://github.com/apache/parquet-mr/blob/d61d221c9e752ce2cc0da65ede8b55653b3ae21f/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L83]
> affects performance.
> If do not use the {{toByteBuffer}} when compare binary, the result is:
> {noformat}
> -- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
> Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
> Parquet write benchmark 2.4.0-SNAPSHOT: Best/Avg Time(ms) Rate(M/s) Per
> Row(ns) Relative
> ------------------------------------------------------------------------------------------------
> long type 11171 / 11508 0.0
> 2234189382.0 1.0X
> string type 30072 / 30290 0.0
> 6014346455.4 0.4X
> decimal(18, 0) type 12150 / 12239 0.0
> 2430052708.8 0.9X
> decimal(38, 18) type 44974 / 45423 0.0
> 8994773738.8 0.2X
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)