[jira] [Updated] (PARQUET-1355) Improvement parquet Binary write performance

Yuming Wang (JIRA) Mon, 23 Jul 2018 19:45:31 -0700


     [ 
https://issues.apache.org/jira/browse/PARQUET-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yuming Wang updated PARQUET-1355:
---------------------------------
    Description: 
*Benchmark code*:
{code:java}
test("Parquet write benchmark") {
  val count = 100 * 1024 * 1024
  val numIters = 5
  withTempPath { path =>
    val benchmark = new Benchmark(s"Parquet write benchmark 
${spark.sparkContext.version}", 5)

    Seq("long", "string", "decimal(18, 0)", "decimal(38, 18)").foreach { dt =>
      benchmark.addCase(s"$dt type", numIters = numIters) { iter =>
        spark.range(count).selectExpr(s"cast(id as $dt) as id")
          .write.mode("overwrite").parquet(path.getAbsolutePath)
      }
    }
    benchmark.run()
  }
}
{code}
*Result*:
{noformat}
-- Spark 2.3.3-SNAPSHOT with Parquet 1.8.3

Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz

Parquet write benchmark 2.3.3-SNAPSHOT:  Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
------------------------------------------------------------------------------------------------
long type                                   10963 / 11344          0.0  
2192675973.8       1.0X
string type                                 28423 / 29437          0.0  
5684553922.2       0.4X
decimal(18, 0) type                         11558 / 11696          0.0  
2311587203.6       0.9X
decimal(38, 18) type                        43858 / 44432          0.0  
8771537663.4       0.2X


-- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0

Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz

Parquet write benchmark 2.4.0-SNAPSHOT:  Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
------------------------------------------------------------------------------------------------
long type                                   11633 / 12070          0.0  
2326572295.8       1.0X
string type                                 31374 / 32178          0.0  
6274760187.4       0.4X
decimal(18, 0) type                         13019 / 13294          0.0  
2603841925.4       0.9X
decimal(38, 18) type                        50719 / 50983          0.0 
10143775007.6       0.2X
{noformat}
The mainly is 
[toByteBuffer|https://github.com/apache/parquet-mr/blob/d61d221c9e752ce2cc0da65ede8b55653b3ae21f/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L83]
 affects performance.
 If do not use the {{toByteBuffer}} when compare binary, the result is:
{noformat}
-- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0

Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz

Parquet write benchmark 2.4.0-SNAPSHOT:  Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
------------------------------------------------------------------------------------------------
long type                                   11171 / 11508          0.0  
2234189382.0       1.0X
string type                                 30072 / 30290          0.0  
6014346455.4       0.4X
decimal(18, 0) type                         12150 / 12239          0.0  
2430052708.8       0.9X
decimal(38, 18) type                        44974 / 45423          0.0  
8994773738.8       0.2X
{noformat}

  was:
*Benchmark code*:
{code:java}
test("Parquet write benchmark") {
  val count = 100 * 1024 * 1024
  val numIters = 5
  withTempPath { path =>
    val benchmark = new Benchmark(s"Parquet write benchmark 
${spark.sparkContext.version}", 5)

    Seq("long", "string", "decimal(18, 0)", "decimal(38, 18)", 
"timestamp").foreach { dt =>
      benchmark.addCase(s"$dt type", numIters = numIters) { iter =>
        spark.range(count).selectExpr(s"cast(id as $dt) as id")
          .write.mode("overwrite").parquet(path.getAbsolutePath)
      }
    }
    benchmark.run()
  }
}
{code}

*Result*:

{noformat}
-- Spark 2.3.3-SNAPSHOT with Parquet 1.8.3

Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz

Parquet write benchmark 2.3.3-SNAPSHOT:  Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
------------------------------------------------------------------------------------------------
long type                                   10963 / 11344          0.0  
2192675973.8       1.0X
string type                                 28423 / 29437          0.0  
5684553922.2       0.4X
decimal(18, 0) type                         11558 / 11696          0.0  
2311587203.6       0.9X
decimal(38, 18) type                        43858 / 44432          0.0  
8771537663.4       0.2X


-- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0

Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz

Parquet write benchmark 2.4.0-SNAPSHOT:  Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
------------------------------------------------------------------------------------------------
long type                                   11633 / 12070          0.0  
2326572295.8       1.0X
string type                                 31374 / 32178          0.0  
6274760187.4       0.4X
decimal(18, 0) type                         13019 / 13294          0.0  
2603841925.4       0.9X
decimal(38, 18) type                        50719 / 50983          0.0 
10143775007.6       0.2X
{noformat}


The mainly is 
[toByteBuffer|https://github.com/apache/parquet-mr/blob/d61d221c9e752ce2cc0da65ede8b55653b3ae21f/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L83]
 affects performance.
If do not use the {{toByteBuffer}} when compare binary, the result is:
{noformat}
-- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0

Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz

Parquet write benchmark 2.4.0-SNAPSHOT:  Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
------------------------------------------------------------------------------------------------
long type                                   11171 / 11508          0.0  
2234189382.0       1.0X
string type                                 30072 / 30290          0.0  
6014346455.4       0.4X
decimal(18, 0) type                         12150 / 12239          0.0  
2430052708.8       0.9X
decimal(38, 18) type                        44974 / 45423          0.0  
8994773738.8       0.2X
{noformat}


> Improvement parquet Binary write performance
> --------------------------------------------
>
>                 Key: PARQUET-1355
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1355
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.10.0
>            Reporter: Yuming Wang
>            Assignee: Yuming Wang
>            Priority: Major
>
> *Benchmark code*:
> {code:java}
> test("Parquet write benchmark") {
>   val count = 100 * 1024 * 1024
>   val numIters = 5
>   withTempPath { path =>
>     val benchmark = new Benchmark(s"Parquet write benchmark 
> ${spark.sparkContext.version}", 5)
>     Seq("long", "string", "decimal(18, 0)", "decimal(38, 18)").foreach { dt =>
>       benchmark.addCase(s"$dt type", numIters = numIters) { iter =>
>         spark.range(count).selectExpr(s"cast(id as $dt) as id")
>           .write.mode("overwrite").parquet(path.getAbsolutePath)
>       }
>     }
>     benchmark.run()
>   }
> }
> {code}
> *Result*:
> {noformat}
> -- Spark 2.3.3-SNAPSHOT with Parquet 1.8.3
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
> Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
> Parquet write benchmark 2.3.3-SNAPSHOT:  Best/Avg Time(ms)    Rate(M/s)   Per 
> Row(ns)   Relative
> ------------------------------------------------------------------------------------------------
> long type                                   10963 / 11344          0.0  
> 2192675973.8       1.0X
> string type                                 28423 / 29437          0.0  
> 5684553922.2       0.4X
> decimal(18, 0) type                         11558 / 11696          0.0  
> 2311587203.6       0.9X
> decimal(38, 18) type                        43858 / 44432          0.0  
> 8771537663.4       0.2X
> -- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
> Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
> Parquet write benchmark 2.4.0-SNAPSHOT:  Best/Avg Time(ms)    Rate(M/s)   Per 
> Row(ns)   Relative
> ------------------------------------------------------------------------------------------------
> long type                                   11633 / 12070          0.0  
> 2326572295.8       1.0X
> string type                                 31374 / 32178          0.0  
> 6274760187.4       0.4X
> decimal(18, 0) type                         13019 / 13294          0.0  
> 2603841925.4       0.9X
> decimal(38, 18) type                        50719 / 50983          0.0 
> 10143775007.6       0.2X
> {noformat}
> The mainly is 
> [toByteBuffer|https://github.com/apache/parquet-mr/blob/d61d221c9e752ce2cc0da65ede8b55653b3ae21f/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L83]
>  affects performance.
>  If do not use the {{toByteBuffer}} when compare binary, the result is:
> {noformat}
> -- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
> Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
> Parquet write benchmark 2.4.0-SNAPSHOT:  Best/Avg Time(ms)    Rate(M/s)   Per 
> Row(ns)   Relative
> ------------------------------------------------------------------------------------------------
> long type                                   11171 / 11508          0.0  
> 2234189382.0       1.0X
> string type                                 30072 / 30290          0.0  
> 6014346455.4       0.4X
> decimal(18, 0) type                         12150 / 12239          0.0  
> 2430052708.8       0.9X
> decimal(38, 18) type                        44974 / 45423          0.0  
> 8994773738.8       0.2X
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PARQUET-1355) Improvement parquet Binary write performance

Reply via email to