yihua commented on issue #5385: URL: https://github.com/apache/hudi/issues/5385#issuecomment-1110592113
> what do you mean by schema update? Is the modifying existing hudi table before upsert? Or you talking about just adding _hoodie_is_deleted to the data set used for the upsert? So you need to change the schema by adding the `_hoodie_is_deleted` to schema before the next upsert. Then, for the upsert, you need to have the field `_hoodie_is_deleted` for the batch and set the `_hoodie_is_deleted` to true for the records to be deleted. You can find a concrete example below derived from the [Deletes docs](https://hudi.apache.org/docs/writing_data/#deletes). > Also is _hoodie_is_deleted a system field? If so why isnt it always present? `_hoodie_is_deleted` is used by the write client internally to identify deletes. Only when you want to support both inserts, updates, and deletes in the same batch with UPSERT operation, you need this field. We're going to relax this requirement of adding `_hoodie_is_deleted` in the future. This field is not required in all cases, e.g., the DELETE operation, normal inserts, upserts without deletes. > Is there a good example of this? You can check [Deletes docs](https://hudi.apache.org/docs/writing_data/#deletes): - Using DataSource or DeltaStreamer to delete records: add a column named `_hoodie_is_deleted` to DataSet. The value of this column must be set to `true` for all the records to be deleted and either `false` or left `null` for any records which are to be upserted. Let's say the original schema is: ``` { "type":"record", "name":"example_tbl", "fields":[{ "name": "uuid", "type": "String" }, { "name": "ts", "type": "string" }, { "name": "partitionPath", "type": "string" }, { "name": "rank", "type": "long" } ]} ``` Make sure you add `_hoodie_is_deleted` column: ``` { "type":"record", "name":"example_tbl", "fields":[{ "name": "uuid", "type": "String" }, { "name": "ts", "type": "string" }, { "name": "partitionPath", "type": "string" }, { "name": "rank", "type": "long" }, { "name" : "_hoodie_is_deleted", "type" : "boolean", "default" : false } ]} ``` Then any record you want to delete you can mark `_hoodie_is_deleted` as true: ``` {"ts": 0.0, "uuid": "19tdb048-c93e-4532-adf9-f61ce6afe10", "rank": 1045, "partitionpath": "americas/brazil/sao_paulo", "_hoodie_is_deleted" : true} ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
