yihua commented on issue #5385:
URL: https://github.com/apache/hudi/issues/5385#issuecomment-1110592113

   > what do you mean by schema update? Is the modifying existing hudi table 
before upsert? Or you talking about just adding _hoodie_is_deleted to the data 
set used for the upsert?
   
   So you need to change the schema by adding the `_hoodie_is_deleted` to 
schema before the next upsert.  Then, for the upsert, you need to have the 
field `_hoodie_is_deleted` for the batch and set the `_hoodie_is_deleted` to 
true for the records to be deleted.  You can find a concrete example below 
derived from the [Deletes 
docs](https://hudi.apache.org/docs/writing_data/#deletes).
   
   > Also is _hoodie_is_deleted a system field? If so why isnt it always 
present?
   
   `_hoodie_is_deleted` is used by the write client internally to identify 
deletes.  Only when you want to support both inserts, updates, and deletes in 
the same batch with UPSERT operation, you need this field.  We're going to 
relax this requirement of adding `_hoodie_is_deleted` in the future.  This 
field is not required in all cases, e.g., the DELETE operation, normal inserts, 
upserts without deletes.
   
   > Is there a good example of this?
   
   You can check [Deletes 
docs](https://hudi.apache.org/docs/writing_data/#deletes):
   
   - Using DataSource or DeltaStreamer to delete records: add a column named 
`_hoodie_is_deleted` to DataSet. The value of this column must be set to `true` 
for all the records to be deleted and either `false` or left `null` for any 
records which are to be upserted.
   
   Let's say the original schema is:
   ```
   {
     "type":"record",
     "name":"example_tbl",
     "fields":[{
        "name": "uuid",
        "type": "String"
     }, {
        "name": "ts",
        "type": "string"
     },  {
        "name": "partitionPath",
        "type": "string"
     }, {
        "name": "rank",
        "type": "long"
     }
   ]}
   ```
   Make sure you add `_hoodie_is_deleted` column:
   ```
   {
     "type":"record",
     "name":"example_tbl",
     "fields":[{
        "name": "uuid",
        "type": "String"
     }, {
        "name": "ts",
        "type": "string"
     },  {
        "name": "partitionPath",
        "type": "string"
     }, {
        "name": "rank",
        "type": "long"
     }, {
       "name" : "_hoodie_is_deleted",
       "type" : "boolean",
       "default" : false
     }
   ]}
   ```
   Then any record you want to delete you can mark `_hoodie_is_deleted` as true:
   ```
   {"ts": 0.0, "uuid": "19tdb048-c93e-4532-adf9-f61ce6afe10", "rank": 1045, 
"partitionpath": "americas/brazil/sao_paulo", "_hoodie_is_deleted" : true}
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to