mao-liu opened a new issue, #7030:
URL: https://github.com/apache/paimon/issues/7030

   ### Search before asking
   
   - [x] I searched in the [issues](https://github.com/apache/paimon/issues) 
and found nothing similar.
   
   
   ### Paimon version
   
   1.3.1
   
   ### Compute Engine
   
   Flink 1.20
   
   ### Minimal reproduce step
   
   Disclaimer: performance issues are observed for really big tables
   
   Table:
   - Several thousand partitions
   - PK table
   - Fixed bucket (e.g. 32)
   - Partitioned by a string field in the data (not partitioned by time)
   - Table stored in cloud object store (e.g. S3)
   
   Flink jobs:
   - Streaming, write-only, all partitions are receiving realtime updates
   - Batch, full compaction, all partitions require compaction
   
   For the above setup, we noticed that write performance degrades quite 
significantly, and the performance does **not** correlate with the size of the 
table or streaming data volume.
   
   After a lot of debugging and logging, we've identified a significant 
bottleneck for highly partitioned tables:
   - The bottleneck appears to be 
[FileSystemWriteRestore.restoreFiles](https://github.com/apache/paimon/blob/release-1.3/paimon-core/src/main/java/org/apache/paimon/operation/FileSystemWriteRestore.java#L80),
 which is invoked via 
[AbstractFileStoreWrite.createWriterContainer](https://github.com/apache/paimon/blob/release-1.3/paimon-core/src/main/java/org/apache/paimon/operation/AbstractFileStoreWrite.java#L415)
 for each combination of (partition, bucket)
   - Each execution of `restoreFiles` would fetch manifest files
     - This leads to API requests to retrieve files from cloud filesystem
     - This is also computationally expensive as manifest files (avro/parquet) 
are decompressed and deserialized into the right object models
   - For a highly partitioned table receiving writes across all partitions, 
this could lead to the manifest being fetched 100,000s times
     - This results in excessively high API costs for object retrieval
     - This also lead to very long execution time of batch jobs, or unstable 
streaming jobs that may not keep up with streaming data loads
   - Dedicated compaction job performance is particularly impacted by this
   
   ### What doesn't meet your expectations?
   
   - Reduce cloud API requests to not fetch manifest in an excessively 
duplicated manner
   - Improve initialization performance for writers
   
   ### Anything else?
   
   We have been testing a patch that pre-fetches the entire manifest and caches 
in memory, to avoid duplicated API requests to retrieve the same object.
   
   Caching also avoids repeated decompression/deserialization of manifest files.
   
   For a batch compaction job:
   - Release 1.3: ~3h run time, ~300k API requests
   - With patch: ~5min run time, ~10k API requests => ~30x improvement of both 
metrics
   
   ### Are you willing to submit a PR?
   
   - [x] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to