Daniel Ford created HUDI-5973:
---------------------------------
Summary: Add cachedSchema per write batch to fix idempotency with
getSourceSchema calls
Key: HUDI-5973
URL: https://issues.apache.org/jira/browse/HUDI-5973
Project: Apache Hudi
Issue Type: Task
Components: deltastreamer
Reporter: Daniel Ford
The issue is. getSourceScheme in case of SchemaRegistry provider is not
idempotent. even within a single batch of write, if we call getSourceSchema
multiple times, it could return latest schema from the schema registry. ideally
we want it to return one schema for one batch of write.
so, the fix is to add a new api to Source abstract class called "clearCaches"
or "cleanupResources". also add similar apis to SchemaProvider. and so within
source.clearCaches, we will call schemaProvider.clearCaches.
Incase of SchemaRegistryProvider, for every batch, we will fetch from remote
schema registry and cache is locally. for subsequent calls to getsourceSchema,
we will be returning the same value. before moving onto next batch of consume,
we will have to call clearCaches which will invalidate the local cache of
source schema.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)