aokolnychyi commented on a change in pull request #2210:
URL: https://github.com/apache/iceberg/pull/2210#discussion_r580544861
##########
File path: spark3/src/main/java/org/apache/iceberg/spark/Spark3Util.java
##########
@@ -790,4 +804,53 @@ public Identifier identifier() {
public static TableIdentifier identifierToTableIdentifier(Identifier
identifier) {
return TableIdentifier.of(Namespace.of(identifier.namespace()),
identifier.name());
}
+
+ /**
+ * Use Spark to list all partitions in the table.
+ *
+ * @param spark a Spark session
+ * @param rootPath a table identifier
+ * @param format format of the file
+ * @return all table's partitions
+ */
+ public static List<SparkTableUtil.SparkPartition> getPartitions(SparkSession
spark, Path rootPath, String format) {
+ FileStatusCache fileStatusCache = FileStatusCache.getOrCreate(spark);
+ Map<String, String> emptyMap = Collections.emptyMap();
+
+ InMemoryFileIndex fileIndex = new InMemoryFileIndex(
Review comment:
Alright, I think we all agree that `add_files` procedure should be
limited to identity partitioning and, ideally, it should not infer partitions
(i.e. we should not try to build a list of partitions in a given location). The
primary use case for this procedure is partial migration and people can use
SNAPSHOT or MIGRATE if they need to migrate all partitions at the same time.
Instead of inferring a list of partitions ourselves, we better ask the user
which partitions to import.
That means we don't need to use `InMemoryFileIndex` right now. We will come
back to it while adding support for migrating path-based tables. We should also
take other query engines into account.
The open question right now is whether we should support adding multiple
partitions in the same call? That seems beneficial but how do we do this and
what would it mean for the procedure API? In earlier discussions, @rdblue
mentioned supporting something like `part_col1=*/part_col2=x` would be great.
While it would be easier for us to support only a single partition, I'd vote
for a better user experience.
We could represent partition path as a string with wildcards but I'd suggest
using a map.
```
iceberg.system.add_files(
table => 'db.tbl', -- required
path => '/root/table/location', -- required
format => 'parquet', -- required
partition => map('p1', '*', 'p2', 'v2') -- required for partitioned tables
)
```
We may add support for `*` later but using a map instead of a string does
not require us parsing the path to get individual partitions.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]