aokolnychyi commented on a change in pull request #2210:
URL: https://github.com/apache/iceberg/pull/2210#discussion_r578108148
##########
File path: spark3/src/main/java/org/apache/iceberg/spark/Spark3Util.java
##########
@@ -790,4 +804,53 @@ public Identifier identifier() {
public static TableIdentifier identifierToTableIdentifier(Identifier
identifier) {
return TableIdentifier.of(Namespace.of(identifier.namespace()),
identifier.name());
}
+
+ /**
+ * Use Spark to list all partitions in the table.
+ *
+ * @param spark a Spark session
+ * @param rootPath a table identifier
+ * @param format format of the file
+ * @return all table's partitions
+ */
+ public static List<SparkTableUtil.SparkPartition> getPartitions(SparkSession
spark, Path rootPath, String format) {
+ FileStatusCache fileStatusCache = FileStatusCache.getOrCreate(spark);
+ Map<String, String> emptyMap = Collections.emptyMap();
+
+ InMemoryFileIndex fileIndex = new InMemoryFileIndex(
Review comment:
I think adding Iceberg files through this API is off the table. The only
case we are talking about is the import of existing non-Iceberg
tables/partitions. In this PR, Russell uses `InMemoryFileIndex` to get a list
of partitions for a given root table location.
The question I was debating in my previous comment is whether we need to
support importing root table locations. As I said earlier, the main use case
for `add_files` will probably be to add a few partitions after CREATE TABLE
LIKE if the original non-Iceberg table contains a lot of partitions and calling
SNAPSHOT may be expensive. Another use case is gradual migration. In that case,
we don't need to infer what partitions are present. We can have the following
procedure:
```
iceberg.system.add_files(
table => 'db.tbl', -- required
path => '/root/table/location', -- required
format => 'parquet', -- required
partition => map('p1', 'v1', 'p2', 'v2') -- required for partitioned tables
)
```
In this case, we just combine the partition tuple with the root location and
import a single partition at a time.
The question is whether we should make `partition` optional for partitioned
tables so that you can just point to the root table location. If yes, we will
have to use something like `InMemoryFileIndex` or our own solution to find out
which partitions are present in that location. I would like to avoid that but
we need to come up with a solution for importing non-HMS tables.
For example, I think this should be valid:
```
MIGRATE `parquet`.`/path/to/table/on/hdfs`
USING iceberg
```
It seems we will need to know what partitions are present in that case.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]