[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2210: Adds AddFiles Procedure

GitBox Mon, 22 Feb 2021 11:52:30 -0800


aokolnychyi commented on a change in pull request #2210:
URL: https://github.com/apache/iceberg/pull/2210#discussion_r580544861




##########
File path: spark3/src/main/java/org/apache/iceberg/spark/Spark3Util.java
##########
@@ -790,4 +804,53 @@ public Identifier identifier() {
   public static TableIdentifier identifierToTableIdentifier(Identifier 
identifier) {
     return TableIdentifier.of(Namespace.of(identifier.namespace()), 
identifier.name());
   }
+
+  /**
+   * Use Spark to list all partitions in the table.
+   *
+   * @param spark a Spark session
+   * @param rootPath a table identifier
+   * @param format format of the file
+   * @return all table's partitions
+   */
+  public static List<SparkTableUtil.SparkPartition> getPartitions(SparkSession 
spark, Path rootPath, String format) {
+    FileStatusCache fileStatusCache = FileStatusCache.getOrCreate(spark);
+    Map<String, String> emptyMap = Collections.emptyMap();
+
+    InMemoryFileIndex fileIndex = new InMemoryFileIndex(

Review comment:
       Alright, I think we all agree that `add_files` procedure should be 
limited to identity partitioning and, ideally, it should not infer partitions 
(i.e. we should not try to build a list of partitions in a given location). The 
primary use case for this procedure is partial migration and people can use 
SNAPSHOT or MIGRATE if they need to migrate all partitions at the same time. 
Instead of inferring a list of partitions ourselves, we better ask the user 
which partitions to import.
   
   That means we don't need to use `InMemoryFileIndex` right now. We will come 
back to it while adding support for migrating path-based tables. We should also 
take other query engines into account.
   
   The open question right now is whether we should support adding multiple 
partitions in the same call? That seems beneficial but how do we do this and 
what would it mean for the procedure API? In earlier discussions, @rdblue 
mentioned supporting something like `part_col1=*/part_col2=x` would be great. 
While it would be easier for us to support only a single partition, I'd vote 
for a better user experience.
   
   We could represent partition path as a string with wildcards but I'd suggest 
using a map.
   
   ```
   iceberg.system.add_files(
     table => 'db.tbl', -- required
     path => '/root/table/location', -- required
     format => 'parquet', -- required
     partition => map('p1', '*', 'p2', 'v2') -- required for partitioned tables
   )
   ```
   
   We may add support for `*` later but using a map instead of a string does 
not require us parsing the path to get individual partitions.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2210: Adds AddFiles Procedure

Reply via email to