ashvina commented on code in PR #634: URL: https://github.com/apache/incubator-xtable/pull/634#discussion_r1953883251
########## rfc/rfc-2/2 - Deletion Info Conversion.md: ########## @@ -0,0 +1,159 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# RFC-2: Support for conversion of Deletion Vectors from Delta Lake to Iceberg format + +## Proposers** + +- @ashvin + +## Status + +GH Feature Request: https://github.com/apache/incubator-xtable/issues/339 + +## Abstract +Deletion vectors are one of the most popular features in table formats, allowing logical deletion of records, by marking +them as deleted, without requiring rewriting of the data files to remove them physically. This document provides a +high-level proposal of deletion vector conversion in XTable, focusing on the current implementation of deletion vectors +in Delta Lake and Iceberg. + +## Background +Deletion vectors feature in table formats like **_Delta Lake_** and **_Apache Iceberg_** are designed to enhance data +management efficiency. These vectors allow for the marking of deleted rows without physically removing them from the +data files, thereby optimizing operations such as `DELETE`, `UPDATE`, and `MERGE`. By maintaining a separate files of +deletions, deletion vectors enable faster query performance and reduce the need for costly data file rewrites. This +approach aligns with the merge-on-read paradigm, where changes are merged during read operations rather than write +operations, ensuring minimal disruption to ongoing data ingestion processes. + +The main types of delete vectors are _**equality deletes**_ and _**position deletes**_. + +#### Equality deletes +Equality deletes allow for row-level deletions based on specific column values rather than their positions within a +file. This means that records in data files matching certain criteria are considered deleted without altering the +original data files. This deletion vector type is useful when identifying and storing the exact position of the row(s) +is expensive, but the content that identifies the row is clear. This feature is particularly useful in streaming use +cases where updates are frequent. Currently, Iceberg supports equality deletes, however, there is a proposal to replace +it in future versions. + +#### Position deletes +Position deletes, on the other hand, specify the ordinals of records within a data file that should be considered +deleted. This method relies on identifying the exact position within the data file, thereby providing a more precise and +direct approach to managing row-level deletions. Position deletes can be further categorized into two representations: +simple table representations and compressed bitmap representations. + +**_Simple table representations_** involve a straightforward structure typically stored in Parquet format, containing Review Comment: Instead of loading the deletion file directly, XTable provides the file path, offset, and other inputs to the Delta Lake library. This library then returns a stream of ordinals. So, XTable is agnostic of the file format used and should handle all file formats that the Delta lib supports. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
