rdblue commented on issue #1286:
URL: https://github.com/apache/iceberg/issues/1286#issuecomment-670594342


   I want to clarify here that the amount of data rewritten during a compaction 
should not matter very much. In many cases, the initial commit will fail 
because an operation can take a long time. What matters is how long a retry 
takes because retries are metadata-only operations.
   
   For operations like compaction, Iceberg needs to rewrite existing manifests 
to remove files that were compacted. Any filtered and rewritten manifest is 
cached so that a retry doesn't never needs to rewrite the same manifest file 
twice. Iceberg will also use manifest file metadata to avoid even scanning 
manifests that cannot contain the files it is replacing. In most cases with a 
Spark streaming job appending to a table, we would expect the new manifests to 
not require scanning and for old manifests to be unchanged by the appends. Then 
all of the initial manifest rewrite work can be reused in a retry and it should 
proceed quickly enough to commit within the 10s interval. (The minimum amount 
of work to commit is writing a manifest list and a metadata JSON file, which 
should be well under 10s, even with S3.)
   
   I think what we need to do to debug this case is to find out what work the 
retries are doing. In our environment, we can log file system operations, so we 
can see what files are being created in each attempt and how long these are 
taking. Can someone try to reproduce the issue and attach the log from the 
compaction so we can see what is happening that causes the retry to take so 
long?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to