stankiewicz commented on PR #38149: URL: https://github.com/apache/beam/pull/38149#issuecomment-4243110147
> @stankiewicz Thanks for the review - addressed all the feedback from our discussion. Here's what changed: > > * Catalog lifecycle: Each DoFn now creates its own catalog via `catalogConfig.newCatalog()` in `@Setup` and closes it in `@Teardown`. This removes the reliance on the shared `cachedCatalog` in `IcebergCatalogConfig`, so the fix works correctly regardless of whether the DoFn is passed as instance or deserialized. The existing `catalog()` method is preserved for driver-side operations (pipeline construction, schema inference, etc.). > * Static table cache: `LAST_REFRESHED_TABLE_CACHE` is no longer static - each DoFn owns its own cache instance and passes it to `RecordWriterManager` via the constructor. This prevents a closed catalog's dead Table objects from poisoning other DoFn instances. thanks @dejii . What is important to note is: | What | Previous | Change | | --- | ----------- | ------- | table cache - table identifier --> storage credentials, io | per VM | per harness thread | | iceberg catalog instance | per harness thread | per harness thread | | io close | finish bundle | catalog teardown closes all IOs that need to be closed | | leak | no memory leak | slow memory leak - if doFn runs for long, catalog will grow over time with trackedFileIO with vended credentials | @ahmedabu98 we've discussed having multiple catalog and some quota issues. Can you share some thoughts? This change doesn't reduce amount of catalogs used over time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
