szehon-ho commented on code in PR #5376:
URL: https://github.com/apache/iceberg/pull/5376#discussion_r957690407
##########
core/src/main/java/org/apache/iceberg/BaseFilesTable.java:
##########
@@ -140,42 +143,92 @@ protected CloseableIterable<FileScanTask> doPlanFiles() {
}
static class ManifestReadTask extends BaseFileScanTask implements DataTask {
+
+ static final Set<Integer> READABLE_METRICS_FIELD_IDS =
+ TypeUtil.getProjectedIds(DataFile.READABLE_METRICS.type());
+ static final Schema MIN_PROJECTION_FOR_READABLE_METRICS =
+ new Schema(
+ DataFile.COLUMN_SIZES,
+ DataFile.VALUE_COUNTS,
+ DataFile.NULL_VALUE_COUNTS,
+ DataFile.NAN_VALUE_COUNTS,
+ DataFile.LOWER_BOUNDS,
+ DataFile.UPPER_BOUNDS);
+
private final FileIO io;
private final Map<Integer, PartitionSpec> specsById;
private final ManifestFile manifest;
- private final Schema schema;
+ private final Schema dataTableSchema;
+ private final Schema projection;
ManifestReadTask(
Table table,
ManifestFile manifest,
- Schema schema,
+ Schema projection,
String schemaString,
String specString,
ResidualEvaluator residuals) {
super(DataFiles.fromManifest(manifest), null, schemaString, specString,
residuals);
this.io = table.io();
this.specsById = Maps.newHashMap(table.specs());
this.manifest = manifest;
- this.schema = schema;
+ this.dataTableSchema = table.schema();
+ this.projection = projection;
}
@Override
public CloseableIterable<StructLike> rows() {
- return CloseableIterable.transform(manifestEntries(), file ->
(StructLike) file);
+ if (projection.findColumnName(DataFile.READABLE_METRICS.fieldId()) ==
null) {
+ return CloseableIterable.transform(files(projection), file ->
(StructLike) file);
+ } else {
+ Schema fileProjection = TypeUtil.selectNot(projection,
READABLE_METRICS_FIELD_IDS);
+ Schema minProjection =
Review Comment:
I think putting it there will break the scan right, as its not the
projection the user requested.
Note, this is actually a bit subtle here. Because we are doing the join,
adding the minimum metrics, the file becomes
{any_projected_field_on_file} : {readable_metrics} : {un-projected but
required metrics fields}
So the ContentFileWithMetrics works because it will discard any of the
"un-projected but required metrics fields", given they are outside the range it
will read.
I mean, we could add a select method to GenericDataFile to modify its
internal 'fromProjectionPos' map to conform back to the original projection
(without the "un-projected but required metrics fields", but it's strictly
needed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]