[jira] [Commented] (PARQUET-1355) Improvement Binary write performance

ASF GitHub Bot (JIRA) Mon, 23 Jul 2018 20:06:42 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553730#comment-16553730
 ]


ASF GitHub Bot commented on PARQUET-1355:
-----------------------------------------

wangyum opened a new pull request #505: PARQUET-1355: Improvement Binary write 
performance
URL: https://github.com/apache/parquet-mr/pull/505
 
 
   Details can be found here: 
[PARQUET-1355](https://issues.apache.org/jira/browse/PARQUET-1355).
   The write performance will be increased from `50983 ms` to `45423 ms`, close 
to `44432 ms`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Improvement Binary write performance
> ------------------------------------
>
>                 Key: PARQUET-1355
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1355
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.10.0
>            Reporter: Yuming Wang
>            Assignee: Yuming Wang
>            Priority: Major
>              Labels: pull-request-available
>
> *Benchmark code*:
> {code:java}
> test("Parquet write benchmark") {
>   val count = 100 * 1024 * 1024
>   val numIters = 5
>   withTempPath { path =>
>     val benchmark = new Benchmark(s"Parquet write benchmark 
> ${spark.sparkContext.version}", 5)
>     Seq("long", "string", "decimal(18, 0)", "decimal(38, 18)").foreach { dt =>
>       benchmark.addCase(s"$dt type", numIters = numIters) { iter =>
>         spark.range(count).selectExpr(s"cast(id as $dt) as id")
>           .write.mode("overwrite").parquet(path.getAbsolutePath)
>       }
>     }
>     benchmark.run()
>   }
> }
> {code}
> *Result*:
> {noformat}
> -- Spark 2.3.3-SNAPSHOT with Parquet 1.8.3
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
> Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
> Parquet write benchmark 2.3.3-SNAPSHOT:  Best/Avg Time(ms)    Rate(M/s)   Per 
> Row(ns)   Relative
> ------------------------------------------------------------------------------------------------
> long type                                   10963 / 11344          0.0  
> 2192675973.8       1.0X
> string type                                 28423 / 29437          0.0  
> 5684553922.2       0.4X
> decimal(18, 0) type                         11558 / 11696          0.0  
> 2311587203.6       0.9X
> decimal(38, 18) type                        43858 / 44432          0.0  
> 8771537663.4       0.2X
> -- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
> Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
> Parquet write benchmark 2.4.0-SNAPSHOT:  Best/Avg Time(ms)    Rate(M/s)   Per 
> Row(ns)   Relative
> ------------------------------------------------------------------------------------------------
> long type                                   11633 / 12070          0.0  
> 2326572295.8       1.0X
> string type                                 31374 / 32178          0.0  
> 6274760187.4       0.4X
> decimal(18, 0) type                         13019 / 13294          0.0  
> 2603841925.4       0.9X
> decimal(38, 18) type                        50719 / 50983          0.0 
> 10143775007.6       0.2X
> {noformat}
> The mainly is 
> [toByteBuffer|https://github.com/apache/parquet-mr/blob/d61d221c9e752ce2cc0da65ede8b55653b3ae21f/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L83]
>  affects performance.
>  If do not use the {{toByteBuffer}} when compare binary, the result is:
> {noformat}
> -- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
> Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
> Parquet write benchmark 2.4.0-SNAPSHOT:  Best/Avg Time(ms)    Rate(M/s)   Per 
> Row(ns)   Relative
> ------------------------------------------------------------------------------------------------
> long type                                   11171 / 11508          0.0  
> 2234189382.0       1.0X
> string type                                 30072 / 30290          0.0  
> 6014346455.4       0.4X
> decimal(18, 0) type                         12150 / 12239          0.0  
> 2430052708.8       0.9X
> decimal(38, 18) type                        44974 / 45423          0.0  
> 8994773738.8       0.2X
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1355) Improvement Binary write performance

Reply via email to