yyanyy commented on a change in pull request #1820:
URL: https://github.com/apache/iceberg/pull/1820#discussion_r530607513
##########
File path: core/src/main/java/org/apache/iceberg/ManifestReader.java
##########
@@ -52,8 +54,13 @@
public class ManifestReader<F extends ContentFile<F>>
extends CloseableGroup implements CloseableIterable<F> {
static final ImmutableList<String> ALL_COLUMNS = ImmutableList.of("*");
- static final Set<String> STATS_COLUMNS = Sets.newHashSet(
+
+ // the difference between the two stats set below is to support
ContentFile.copyWithoutStats(), which
+ // still keeps record count.
+ private static final Set<String> STATS_COLUMNS = Sets.newHashSet(
Review comment:
I think `copyWithoutStats` [doesn't discard record
count](https://github.com/apache/iceberg/blob/b1296bcbe8e050d4bc28e3d41feb2f8868c8f0bf/core/src/main/java/org/apache/iceberg/BaseFile.java#L167)
will discarding all column-specific stats.
I do agree that having one list is simpler, the reason for me to do this is
- If we add `record_count` to this list then it will result in a behavior
change, that if people select `record_count` without other stats listed here,
earlier they will not receive those stats, but now they will receive a full
list. This is because `dropStats` relies on this list.
- Alternatively we can stop copying `recordCount` over within
`copyWithoutStats` but I'm not entirely sure if we want to do that since
currently the metrics that can be discarded are all map, and `recordCount` is
`long`; and I guess if we no longer copy `recordCount ` we may as well not copy
`fileSizeInBytes` which is another `long`. After this change since these two
attributes return primitive type, they will return -1, which I'm not sure if
it's the best thing to do.
I think the first approach is safer, but I wasn't sure if it's worth
changing the behavior to keep the code simpler. Do you have a recommendation?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]