rdblue commented on code in PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#discussion_r2913230037
##########
api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java:
##########
@@ -147,6 +147,32 @@ public interface RewriteDataFiles
*/
String OUTPUT_SPEC_ID = "output-spec-id";
+ /**
+ * Use Parquet row-group level merging during rewrite operations when
applicable.
+ *
+ * <p>When enabled, Parquet files will be merged at the row-group level by
directly copying row
Review Comment:
If I understand correctly from the comment @lintingbin wrote, it sounds like
this is an attempt to decrease the cost of compaction when it is unstable --
that is, when files that have already been compacted (the 150 MB file) are
compacted a second time. It's a little unclear, but I think the assertion in
the last item (5) is that this is useful if you first rewrite small files to a
larger file and then compact the larger files without rewriting row groups.
This would mean a 2-pass approach: first rewrite the content into medium-sized
files (and whole row groups) and then rewrite into large files with multiple
row groups.
I don't understand the value of that approach. Once you've solved the small
files problem (~100x file count) by rewriting into larger row groups, the
additional benefit of a second compaction is very low (~2x file count). I don't
see why you would perform the second compaction at all if it is just
concatenating the row groups from other files. As long as you're rewriting the
data a second time, it makes much more sense to prepare the data for long-term
storage and query by clustering and ordering the rows. That would significantly
decrease overall size and speed up queries at the same time, which is worth the
cost of the rewrite.
And while you're clustering and sorting data, I doubt it makes sense to do
the initial rewrite as well. Why incur the cost of rewriting and then not
reorganize the data in the first pass as long as you're already rewriting to
avoid tiny row groups?
I don't see much value in exposing this -- is it really something that is
worth supporting when it is extremely limited and has a very narrow use case
(if any)?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]