wypoon commented on code in PR #7744:
URL: https://github.com/apache/iceberg/pull/7744#discussion_r1237616093
##########
data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java:
##########
@@ -83,27 +83,47 @@ public static List<DataFile> listPartition(
return listPartition(partition, uri, format, spec, conf, metricsConfig,
mapping, 1);
}
+ /**
+ * Returns the data files in a partition by listing the partition location
using some number of
Review Comment:
You're right.
##########
data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java:
##########
@@ -83,27 +83,47 @@ public static List<DataFile> listPartition(
return listPartition(partition, uri, format, spec, conf, metricsConfig,
mapping, 1);
}
+ /**
+ * Returns the data files in a partition by listing the partition location
using some number of
+ * threads.
+ *
+ * <p>For Parquet and ORC partitions, this will read metrics from the file
footer. For Avro
+ * partitions, metrics are set to null.
+ *
+ * <p>Note: certain metrics, like NaN counts, that are only supported by
Iceberg file writers but
+ * not file footers, will not be populated.
+ *
+ * @param partition map of partition columns to column values
+ * @param uri partition location URI
+ * @param format partition format, avro, parquet or orc
+ * @param spec a partition spec
+ * @param conf a Hadoop conf
+ * @param metricsConfig a metrics conf
+ * @param mapping a name mapping
+ * @param parallelism number of threads to use
Review Comment:
Actually no. I have improved the javadoc in the latest revision to explain
the parallelism.
There is only one partition `listPartition` works on (as the name implies).
The parallelism is in the reading of the data files (after the list of files is
obtained) to get the metrics. The listing itself is done single-threaded.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]