sydneyhoran opened a new issue, #9143: URL: https://github.com/apache/hudi/issues/9143
**Describe the problem you faced** Hi there, we are facing an issue that is failing to delete records using Deltastreamer on a PostgresDebeziumSource. We are seeing that the `delete` messages from Debezium/Kafka only provide the primary_key id to be deleted, with all other columns set to default values `null`/`""`/`0`. Since Deltastreamer is trying to delete that ID in the `inserted_at` partition of `0`ms (aka `1970-01-01`), it cannot find the record and the commit fails (unless commit-on-errors is used, then the delete will be ignored and the data will have rows it should not). One question is whether [this PR](https://github.com/apache/hudi/pull/8690) accounts for the partition column being passed as 0 or is that a separate issue? It could possibly mean that the upstream Postgres databases are using [Replica Identity](https://debezium.io/documentation/reference/1.6/connectors/postgresql.html#postgresql-replica-identity) `default` on the source tables, but the behavior here is a little different than[ the documentation](https://debezium.io/documentation/reference/1.6/connectors/postgresql.html#postgresql-replica-identity) (it looks like the "before" would only have the PK column, however our "before" has all columns just with their default values). Wondering if anyone else has faced this before, or if there could be a workaround in Deltastreamer that allows it to delete the record without knowing what partition it is from. We will document any findings during our investigation/experimentation with upstream sources. Thanks in advance! Example delete message from Debezium/Kafka: ``` { "topic_name": "<redacted>", "partition": 0, "offset": 136251, "value": { "before": { "id": "4b-<redacted>-09", "col1": "", "col2": "", "col3": 0, "col4": 0, "col5": null, "col6": 0, "inserted_at": 0, }, "after": null, "source": { "version": "1.9.6.Final", "connector": "postgresql", "name": "<redacted>", "ts_ms": 1688672165304, "snapshot": "false", "db": "<redacted>", "sequence": "[\"121985431416\",\"121985482592\"]", "schema": "<redacted>", "table": "<redacted>", "txId": 45263257, "lsn": 121985482592, "xmin": null }, "op": "d", "ts_ms": 1688672165397, "transaction": null } } ``` **To Reproduce** Steps to reproduce the behavior: 1. Run Deltastreamer on a PostgresDebeziumSource connected to a Kafka topic, using Timestamp Key generator 2. Perform a delete on the Postgres DB and emit the record to Kafka through the Debezium replication slot 3. Monitor Deltastreamer logs for this error ``` Error for key:HoodieKey { recordKey=4b-<redacted>-09 partitionPath=1970/01/01} is java.util.NoSuchElementException ``` 4. Deltastreamer commit will fail and rollback, cannot proceed past that message in Kafka 5. Turn on commit-on-errors and the error is ignored but the record is not deleted **Expected behavior** The delete should be **Environment Description** * Hudi version : 13 * Spark version : 3.1 * Hive version : N/A * Hadoop version : N/A * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : yes **Additional context** [Previous issue](https://github.com/apache/hudi/issues/8519) may provide more context (this issue is a continuation after the tombstone/NullPointer issue was solved). **Stacktrace** ``` 23/07/07 17:31:24 ERROR org.apache.hudi.utilities.deltastreamer.DeltaSync: Delta Sync found errors when writing. Errors/Total=2448/4998 23/07/07 17:31:24 ERROR org.apache.hudi.utilities.deltastreamer.DeltaSync: Printing out the top 100 errors 23/07/07 17:31:25 ERROR org.apache.hudi.utilities.deltastreamer.DeltaSync: Global error : 23/07/07 17:31:25 INFO org.apache.hudi.utilities.deltastreamer.DeltaSync: Error for key:HoodieKey { recordKey=<redacted> partitionPath=1970/01/01} is java.util.NoSuchElementException: No value present in Option 23/07/07 17:31:25 INFO org.apache.hudi.utilities.deltastreamer.DeltaSync: Error for key:HoodieKey { recordKey=<redacted> partitionPath=1970/01/01} is java.util.NoSuchElementException: No value present in Option 23/07/07 17:31:25 INFO org.apache.hudi.utilities.deltastreamer.DeltaSync: Error for key:HoodieKey { recordKey=<redacted> partitionPath=1970/01/01} is java.util.NoSuchElementException: No value present in Option 23/07/07 17:31:25 INFO org.apache.hudi.utilities.deltastreamer.DeltaSync: Error for key:HoodieKey { recordKey=<redacted> partitionPath=1970/01/01} is java.util.NoSuchElementException: No value present in Option <...> 23/07/07 17:31:32 INFO org.apache.hudi.utilities.deltastreamer.DeltaSync: Shutting down embedded timeline server 23/07/07 17:31:32 ERROR org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer: error while running MultiTableDeltaStreamer for table: <redacted> org.apache.hudi.exception.HoodieException: Commit 20230707173107555 failed and rolled-back ! at org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:860) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
