[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2210: Adds AddFiles Procedure

GitBox Wed, 17 Feb 2021 19:30:39 -0800


aokolnychyi commented on a change in pull request #2210:
URL: https://github.com/apache/iceberg/pull/2210#discussion_r578108148




##########
File path: spark3/src/main/java/org/apache/iceberg/spark/Spark3Util.java
##########
@@ -790,4 +804,53 @@ public Identifier identifier() {
   public static TableIdentifier identifierToTableIdentifier(Identifier 
identifier) {
     return TableIdentifier.of(Namespace.of(identifier.namespace()), 
identifier.name());
   }
+
+  /**
+   * Use Spark to list all partitions in the table.
+   *
+   * @param spark a Spark session
+   * @param rootPath a table identifier
+   * @param format format of the file
+   * @return all table's partitions
+   */
+  public static List<SparkTableUtil.SparkPartition> getPartitions(SparkSession 
spark, Path rootPath, String format) {
+    FileStatusCache fileStatusCache = FileStatusCache.getOrCreate(spark);
+    Map<String, String> emptyMap = Collections.emptyMap();
+
+    InMemoryFileIndex fileIndex = new InMemoryFileIndex(

Review comment:
       I think adding Iceberg files through this API is off the table. The only 
case we are talking about is the import of existing non-Iceberg 
tables/partitions. In this PR, Russell uses `InMemoryFileIndex` to get a list 
of partitions for a given root table location.
   
   The question I was debating in my previous comment is whether we need to 
support importing root table locations. As I said earlier, the main use case 
for `add_files` will probably be to add a few partitions after CREATE TABLE 
LIKE if the original non-Iceberg table contains a lot of partitions and calling 
SNAPSHOT may be expensive. Another use case is gradual migration. In that case, 
we don't need to infer what partitions are present. We can have the following 
procedure:
   
   ```
   iceberg.system.add_files(
     table => 'db.tbl', -- required
     path => '/root/table/location', -- required
     format => 'parquet', -- required
     partition => map('p1', 'v1', 'p2', 'v2') -- required for partitioned tables
   )
   ```
   
   In this case, we just combine the partition tuple with the root location and 
import a single partition at a time.
   
   The question is whether we should make `partition` optional for partitioned 
tables so that you can just point to the root table location. If yes, we will 
have to use something like `InMemoryFileIndex` or our own solution to find out 
which partitions are present in that location. I would like to avoid that but 
we need to come up with a solution for importing non-HMS tables.
   
   For example, I think this should be valid:
   
   ```
   MIGRATE `parquet`.`/path/to/table/on/hdfs`
   USING iceberg
   ```
   
   It seems we will need to know what partitions are present in that case.
   
    




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2210: Adds AddFiles Procedure

Reply via email to