MaxGekk commented on pull request #31475: URL: https://github.com/apache/spark/pull/31475#issuecomment-775373033
> ... why is this necessary instead of deleting from the table or overwriting everything with no new records? 1. By emulating table truncation via the insertion of no rows, you require atomic operations: delete + insert but a concrete implementation might not support this though it can atomically truncate a table. 2. You close the room for truncation specific optimizations. If a catalog implementation would know in advance that we want to truncate the entire table instead of deleting all rows, it could do that in a more optimal way. Let's say some file based implementation could move the table folder to a trash folder using one atomic syscall. 3. From security or permissions controls point of view, we could distinguish insert with overwrite (or delete) from truncation. I could imagine a case when some roles/users can have only truncation permissions but not insert or delete permissions. 4. Also it is possible that truncation op is just a record at catalog level log but inserts/deletes are records at table level logs. So, we cannot smoothly sit on such implementation if we emulate table truncation via inserts/deletes. In general, I do believe we should not hide our intention from catalog implementations - truncation should be explicit. Table catalog implementation should decide how to implement in a more optimal way. So, if they can emulate truncation via overwriting with no rows, ok, this is up to them. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
