Limess commented on issue #4031:
URL: https://github.com/apache/hudi/issues/4031#issuecomment-976248107


   > You are seeing a behavior where when "_hoodie_is_deleted" is set to null 
or false, hudi persist this column on storage. And you are asking why do we 
need to do this and why not just drop the column altogether?
   
   Yes that's largely the question. We assumed it would be dropped as the 
deleted records are not persisted and it's otherwise redundant, and there 
already seems to be codepaths to drop redundant columns (e.g. 
`hoodie.datasource.write.drop.partition.columns`)
   
   We were also caught out when we used a string value by mistake. This ended 
up being written to the end datastore, which then broke our schema in a 
seemingly non-recoverable way (as it was written to the table, and now we had a 
schema type change which wasn't obviously compatible). 
   
   I'd suggest:
   * Possibly dropping the column (as you say if it has little benefits sure). 
If not, documenting the behaviour somewhere. Alternatively, always include the 
column, along with the other Hudi metadata fields which are prepended to 
written schema already.
   * If the column is not a boolean:
        * Failing hard, as this column is essentially "reserved" for Hudi
        * Taking `IS NOT NULL` as truthy


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to