[I] Empty schema is committed when no new messages are found in kafka while using DebeziumSource [hudi]

via GitHub Thu, 05 Dec 2024 22:42:55 -0800


ROOBALJINDAL opened a new issue, #12438:
URL: https://github.com/apache/hudi/issues/12438


   **Background:**
   We have created a implementation i.e. MssqlDebeziumSource, similar to 
MysqlDebeziumSource that we already have. We are using apicurio schema registry 
and aws MSK cluster for kafka.
   
   **Issue**: 
   We create a table by ingesting csv using CsvDFSSource and it commits a 
commit file in .hoodie folder with schema details. 
   When we run multistreamer job for the same table to process cdc events in 
kafka for upsert operation, if kafka topic is empty, hudi performs a empty 
commit without schema. Schema details are empty in the commit. 
   
   **Use case:** When hive server goes down, we create a new one, tries to sync 
external table from s3 to restore tables, tables are created without columns 
since schema doesnt have columns in it.
   
   I tested it by changing source class to JsonKafkaSource, then performed CSV 
to CDC kafka transition, it worked fine. Empty commit was performed with schema 
details.
   
   
   **Code details:**
   
![image](https://github.com/user-attachments/assets/0324dee0-1472-4f5e-9556-6afd4e171367)
   
   Here empty dataset is being created without schema, looks like it is 
intentionally done but can we make it configurable? 
   
   **Expected behavior** : Schema should be present even for the empty commit 
for DebeziumSource. If not feasible, can we make it configurable?
   
   **Environment Description**
   
   * Hudi version : 14.0
   
   * Spark version : 3.4.1
   
   * Storage (HDFS/S3/GCS..) : s3
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Empty schema is committed when no new messages are found in kafka while using DebeziumSource [hudi]

Reply via email to