openinx commented on pull request #2372:
URL: https://github.com/apache/iceberg/pull/2372#issuecomment-809976337


   In my original mind,  there are two kinds of compaction:  
   
   a.  convert all equality deletes into position deletes.  As whether should 
we eliminate the duplicate position deletes at the same time, the difference 
for me is:  if the duplicate pos-deletes is removed during rewrite, the user's 
reading efficiency will be higher; if not, the reading efficiency will be 
worse. Generally speaking, I think it is a trade-off problem in performance 
optimization.  Both of them seems to be acceptable to me. 
   
   b. Eliminate all deletes (include pos-deletes and equality-deletes). It is 
very suitable for the situation where delete has a high proportion in the whole 
table.  On the one hand, we can save a lot of unnecessary storage, and on the 
other hand, we can avoid a lot of inefficient joins when reading data.  
[This](https://github.com/apache/iceberg/pull/2303/files#diff-605d0d98a73f67629cddbceb9a566e8655844a3cdf46b4dbcebd0e19102e82b4R128)
 is more simpler to implement compared to the case.a. 
   
   After reading @rdblue 's 
[comment](https://github.com/apache/iceberg/pull/2372#issuecomment-809823407) , 
what makes me feel the most valuable is:  we can use the abstraction of 
meta-column to achieve code unification of case.a, case.b, and the normal read 
path.  Saying if we have an `iterable=Iterable<Row>` with `_is_deleted` flag 
inside each row: 
   
   For case.a,   we could just use 
`Iterables.transform(Iterables.filter(iterable, row -> row.isDeleted()), row -> 
(row.file(), row.pos()))`  to generate all the pos-deletes.
   
   For case.b,  we could just use `Iterables.filter(iterable, row -> 
!row.isDeleted())` to get all remaining rows.
   
   For the normal read path, it's same to the case.b.
   
   This implementation greatly reduces the complexity of various paths, I think 
we can try this kind of code implementation.
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to