chenjunjiedada commented on pull request #1522:
URL: https://github.com/apache/iceberg/pull/1522#issuecomment-707164336
I executed some spark jmh cases with `NUM_RECORDS = 5000000` (I didn't use
the default 10000000, because that causes OOM on my machine with jmh
jvmArgs=-Xmx4096m, not sure whether that happens before), the results are shown
as following:
when not reuse the container
```
Benchmark
Mode Cnt Score Error Units
SparkParquetReadersNestedDataBenchmark.readUsingIcebergReader
ss 5 2.801 ?0.089 s/op
SparkParquetReadersNestedDataBenchmark.readUsingIcebergReaderUnsafe
ss 5 3.383 ?0.090 s/op
SparkParquetReadersNestedDataBenchmark.readUsingSparkReader
ss 5 4.353 ?0.162 s/op
SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingIcebergReader
ss 5 1.488 ?0.051 s/op
SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingIcebergReaderUnsafe
ss 5 1.886 ?0.250 s/op
SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingSparkReader
ss 5 2.078 ?0.227 s/op
```
When reusing the container
```
Benchmark
Mode Cnt Score Error Units
SparkParquetReadersNestedDataBenchmark.readUsingIcebergReader
ss 5 2.707 ?0.053 s/op
SparkParquetReadersNestedDataBenchmark.readUsingIcebergReaderUnsafe
ss 5 3.149 ?0.144 s/op
SparkParquetReadersNestedDataBenchmark.readUsingSparkReader
ss 5 4.344 ?0.168 s/op
SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingIcebergReader
ss 5 1.360 ?0.155 s/op
SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingIcebergReaderUnsafe
ss 5 1.863 ?0.180 s/op
SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingSparkReader
ss 5 2.072 ?0.147 s/op
```
This is spark parquet readers, we can see that when reusing the container it
shows slight benefit. I will try to write some jmh cases for Flink input format
and try it again.
Plus I found two issues:
1. The jmh benchmarks were moved to spark2 module while the comments haven't
updated.
2. The jmh benchmarks cases throw an exception like below:
```
org.apache.iceberg.exceptions.AlreadyExistsException: File already exists:
/tmp/parquet-nested-data-benchmark3999980592702894424.parquet
at org.apache.iceberg.Files$LocalOutputFile.create(Files.java:58)
at
org.apache.iceberg.parquet.ParquetIO$ParquetOutputFile.create(ParquetIO.java:148)
at
org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:295)
at
org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:283)
at
org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:564)
at
org.apache.iceberg.parquet.Parquet$WriteBuilder.build(Parquet.java:265)
at
org.apache.iceberg.spark.data.parquet.SparkParquetReadersNestedDataBenchmark.setupBenchmark(SparkParquetReadersNestedDataBenchmark.java:102)
at
org.apache.iceberg.spark.data.parquet.generated.SparkParquetReadersNestedDataBenchmark_readUsingSparkReader_jmhTest._jmh_tryInit_f_sparkparquetreadersnesteddatabenchmark0_G(SparkParquetReadersNestedDataBenchmark_readUsingSparkReader_jmhTest.java:438)
at
org.apache.iceberg.spark.data.parquet.generated.SparkParquetReadersNestedDataBenchmark_readUsingSparkReader_jmhTest.readUsingSparkReader_SingleShotTime(SparkParquetReadersNestedDataBenchmark_readUsingSparkReader_jmhTest.java:363)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:453)
at
org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:437)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
We need to delete the created temp file at first.
I will fix found issues tmr.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]