mstebelev opened a new issue, #7325:
URL: https://github.com/apache/iceberg/issues/7325

   ### Query engine
   
   Spark
   
   ### Question
   
   I noticed that RewriteManifests in the end tries to copy each manifest 
sequentially in the single thread and it takes a lot of time.
   The stack in UI looks like this:
   ```
   [email protected]/java.util.zip.Inflater.inflateBytesBytes(Native Method)
   [email protected]/java.util.zip.Inflater.inflate(Inflater.java:385) => 
holding Monitor(java.util.zip.Inflater$InflaterZStreamRef@1147147444})
   
[email protected]/java.util.zip.InflaterOutputStream.write(InflaterOutputStream.java:253)
   
app//org.apache.iceberg.shaded.org.apache.avro.file.DeflateCodec.decompress(DeflateCodec.java:83)
   
app//org.apache.iceberg.shaded.org.apache.avro.file.DataFileStream$DataBlock.decompressUsing(DataFileStream.java:392)
   
app//org.apache.iceberg.shaded.org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:226)
   
app//org.apache.iceberg.avro.AvroIterable$AvroReuseIterator.hasNext(AvroIterable.java:191)
   
app//org.apache.iceberg.io.CloseableIterable$7$1.hasNext(CloseableIterable.java:197)
   
app//org.apache.iceberg.ManifestFiles.copyManifestInternal(ManifestFiles.java:311)
   
app//org.apache.iceberg.ManifestFiles.copyRewriteManifest(ManifestFiles.java:288)
   
app//org.apache.iceberg.BaseRewriteManifests.copyManifest(BaseRewriteManifests.java:166)
   
app//org.apache.iceberg.BaseRewriteManifests.addManifest(BaseRewriteManifests.java:155)
   
app//org.apache.iceberg.spark.actions.RewriteManifestsSparkAction$$Lambda$3660/0x00007f746f6fe4b0.accept(Unknown
 Source)
   [email protected]/java.util.Arrays$ArrayList.forEach(Arrays.java:4390)
   
app//org.apache.iceberg.spark.actions.RewriteManifestsSparkAction.replaceManifests(RewriteManifestsSparkAction.java:342)
   
app//org.apache.iceberg.spark.actions.RewriteManifestsSparkAction.doExecute(RewriteManifestsSparkAction.java:193)
   
app//org.apache.iceberg.spark.actions.RewriteManifestsSparkAction$$Lambda$2901/0x00007f74e63aec58.get(Unknown
 Source)
   
app//org.apache.iceberg.spark.actions.BaseSparkAction.withJobGroupInfo(BaseSparkAction.java:127)
   
app//org.apache.iceberg.spark.actions.RewriteManifestsSparkAction.execute(RewriteManifestsSparkAction.java:158)
   
app//org.apache.iceberg.spark.procedures.RewriteManifestsProcedure.lambda$call$0(RewriteManifestsProcedure.java:107)
   
app//org.apache.iceberg.spark.procedures.RewriteManifestsProcedure$$Lambda$2897/0x00007f74e636c508.apply(Unknown
 Source)
   
app//org.apache.iceberg.spark.procedures.BaseProcedure.execute(BaseProcedure.java:100)
   
app//org.apache.iceberg.spark.procedures.BaseProcedure.modifyIcebergTable(BaseProcedure.java:81)
   
app//org.apache.iceberg.spark.procedures.RewriteManifestsProcedure.call(RewriteManifestsProcedure.java:92)
   
app//org.apache.spark.sql.execution.datasources.v2.CallExec.run(CallExec.scala:34)
   ```
   After looking in the code I found out that I can probably disable this 
copying by setting property 
'compatibility.snapshot-id-inheritance.enabled'='true', but it is poorly 
documented and I'm not sure is it safe to use. After reading discussion in 
https://github.com/apache/iceberg/pull/675 looks like it is a flag for writing 
manifests so that old versions of readers are able to read it. It is the only 
purpose?
   Can somebody provide any insight on consequences of setting that property or 
advice how to improve RewriteManifests performance with different way


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to