[
https://issues.apache.org/jira/browse/HUDI-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HUDI-5973:
---------------------------------
Labels: pull-request-available (was: )
> Add cachedSchema per write batch to fix idempotency with getSourceSchema calls
> ------------------------------------------------------------------------------
>
> Key: HUDI-5973
> URL: https://issues.apache.org/jira/browse/HUDI-5973
> Project: Apache Hudi
> Issue Type: Task
> Components: deltastreamer
> Reporter: Daniel Ford
> Priority: Minor
> Labels: pull-request-available
>
> The issue is. getSourceScheme in case of SchemaRegistry provider is not
> idempotent. even within a single batch of write, if we call getSourceSchema
> multiple times, it could return latest schema from the schema registry.
> ideally we want it to return one schema for one batch of write.
> so, the fix is to add a new api to Source abstract class called "clearCaches"
> or "cleanupResources". also add similar apis to SchemaProvider. and so within
> source.clearCaches, we will call schemaProvider.clearCaches.
> Incase of SchemaRegistryProvider, for every batch, we will fetch from remote
> schema registry and cache is locally. for subsequent calls to
> getsourceSchema, we will be returning the same value. before moving onto next
> batch of consume, we will have to call clearCaches which will invalidate the
> local cache of source schema.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)