[ 
https://issues.apache.org/jira/browse/HUDI-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5973:
---------------------------------
    Labels: pull-request-available  (was: )

> Add cachedSchema per write batch to fix idempotency with getSourceSchema calls
> ------------------------------------------------------------------------------
>
>                 Key: HUDI-5973
>                 URL: https://issues.apache.org/jira/browse/HUDI-5973
>             Project: Apache Hudi
>          Issue Type: Task
>          Components: deltastreamer
>            Reporter: Daniel Ford
>            Priority: Minor
>              Labels: pull-request-available
>
> The issue is. getSourceScheme in case of SchemaRegistry provider is not 
> idempotent. even within a single batch of write, if we call getSourceSchema 
> multiple times, it could return latest schema from the schema registry. 
> ideally we want it to return one schema for one batch of write.
> so, the fix is to add a new api to Source abstract class called "clearCaches" 
> or "cleanupResources". also add similar apis to SchemaProvider. and so within 
> source.clearCaches, we will call schemaProvider.clearCaches.
> Incase of SchemaRegistryProvider, for every batch, we will fetch from remote 
> schema registry and cache is locally. for subsequent calls to 
> getsourceSchema, we will be returning the same value. before moving onto next 
> batch of consume, we will have to call clearCaches which will invalidate the 
> local cache of source schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to