chenjunjiedada edited a comment on pull request #1522:
URL: https://github.com/apache/iceberg/pull/1522#issuecomment-707164336
I executed some spark jmh cases with `NUM_RECORDS = 5000000` (I didn't use
`10000000` in current code, because that causes OOM on my machine with jmh
jvmArgs=-Xmx4096m, not sure how we passed before.), the results are shown as
following:
when not reuse the container
```
Benchmark
Mode Cnt Score Error Units
SparkParquetReadersNestedDataBenchmark.readUsingIcebergReader
ss 5 2.801 ?0.089 s/op
SparkParquetReadersNestedDataBenchmark.readUsingIcebergReaderUnsafe
ss 5 3.383 ?0.090 s/op
SparkParquetReadersNestedDataBenchmark.readUsingSparkReader
ss 5 4.353 ?0.162 s/op
SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingIcebergReader
ss 5 1.488 ?0.051 s/op
SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingIcebergReaderUnsafe
ss 5 1.886 ?0.250 s/op
SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingSparkReader
ss 5 2.078 ?0.227 s/op
```
When reusing the container
```
Benchmark
Mode Cnt Score Error Units
SparkParquetReadersNestedDataBenchmark.readUsingIcebergReader
ss 5 2.707 ?0.053 s/op
SparkParquetReadersNestedDataBenchmark.readUsingIcebergReaderUnsafe
ss 5 3.149 ?0.144 s/op
SparkParquetReadersNestedDataBenchmark.readUsingSparkReader
ss 5 4.344 ?0.168 s/op
SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingIcebergReader
ss 5 1.360 ?0.155 s/op
SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingIcebergReaderUnsafe
ss 5 1.863 ?0.180 s/op
SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingSparkReader
ss 5 2.072 ?0.147 s/op
```
It shows slight benefit when reusing the container. @rdblue, Does that make
sense for spark side change? I will try to write some jmh benchmark for Flink
input format and try it again.
Plus I found two issues:
1. The jmh benchmarks were moved to spark2 module while the comments haven't
updated.
2. The jmh benchmarks cases throw an exception like below:
```
org.apache.iceberg.exceptions.AlreadyExistsException: File already exists:
/tmp/parquet-nested-data-benchmark3999980592702894424.parquet
at org.apache.iceberg.Files$LocalOutputFile.create(Files.java:58)
at
org.apache.iceberg.parquet.ParquetIO$ParquetOutputFile.create(ParquetIO.java:148)
at
org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:295)
at
org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:283)
at
org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:564)
at
org.apache.iceberg.parquet.Parquet$WriteBuilder.build(Parquet.java:265)
at
org.apache.iceberg.spark.data.parquet.SparkParquetReadersNestedDataBenchmark.setupBenchmark(SparkParquetReadersNestedDataBenchmark.java:102)
at
org.apache.iceberg.spark.data.parquet.generated.SparkParquetReadersNestedDataBenchmark_readUsingSparkReader_jmhTest._jmh_tryInit_f_sparkparquetreadersnesteddatabenchmark0_G(SparkParquetReadersNestedDataBenchmark_readUsingSparkReader_jmhTest.java:438)
at
org.apache.iceberg.spark.data.parquet.generated.SparkParquetReadersNestedDataBenchmark_readUsingSparkReader_jmhTest.readUsingSparkReader_SingleShotTime(SparkParquetReadersNestedDataBenchmark_readUsingSparkReader_jmhTest.java:363)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:453)
at
org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:437)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
We need to delete the created temp file at first.
I will fix found issues tmr.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]