[GitHub] [iceberg] kbendick commented on a change in pull request #1525: Provide API and Implementation for Creating Iceberg Tables from Spark

GitBox Mon, 28 Sep 2020 21:09:48 -0700


kbendick commented on a change in pull request #1525:
URL: https://github.com/apache/iceberg/pull/1525#discussion_r496356425




##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkSchemaUtil.java
##########
@@ -63,6 +65,21 @@ public static Schema schemaForTable(SparkSession spark, 
String name) {
     return new Schema(converted.asNestedType().asStructType().fields());
   }
 
+  /**
+   * Given a Spark table identifier, determine the PartitionSpec.
+   * @param spark the SparkSession which contains the identifier
+   * @param table a TableIdentifier, if the namespace is left blank the 
catalog().currentDatabase() will be used
+   * @return a IcebergPartitionSpec representing the partitioning of the Spark 
table
+   * @throws AnalysisException if thrown by the Spark catalog
+   */
+  public static PartitionSpec specForTable(SparkSession spark, TableIdentifier 
table) throws AnalysisException {
+    String db = table.database().nonEmpty() ? table.database().get() : 
spark.catalog().currentDatabase();
+    PartitionSpec spec = identitySpec(
+        schemaForTable(spark, table.unquotedString()),
+        spark.catalog().listColumns(db, table.table()).collectAsList());
+    return spec == null ? PartitionSpec.unpartitioned() : spec;

Review comment:
       If currently we're only supporting the creation of tables with 
non-hidden partitions via Spark, should we mention that in the javadoc? I'm 
assuming that we're not supporting hidden partitions due to the [issue 
described here](https://iceberg.apache.org/spark/#insert-overwrite). 
   
   If that is the case, something as simple as the following could suffice
   ```javadoc
   /**
    * Given a Spark table identifier, determine the PartitionSpec.
    * Note that this currently only supports unpartitioned and identity based 
partitioning,
    * due to a bug in Spark 3.0.0: 
https://iceberg.apache.org/spark/#insert-overwrite
    *
    * Hidden partition specs will be silently ignored.
    * ....
    */
   ```
   
   We might also mention that table's with hidden partitions might be 
incorrectly written to using Spark 3.0.0's dynamic partitioning bug.
   
   If that is **_not_** the case, and we're only using non-hidden partitions 
here because that's all that's possible to create from a table derived from 
catalyst's `TableIdentifier` which has no knowledge of hidden partitions (which 
I'm now thinking is the case), I do think that the JavaDoc could still use an 
update. How about something like this...
   
   ```javadoc
   /**
    * Given a Spark table identifier, determine the PartitionSpec, which will 
be either
    * an identity or unpartitioned based on the original table's hive-style 
partition
    * columns.
    */
   ```
   
   I'd love to hear other suggestions, but to me the JavaDoc seems ... somehow 
missing something. But I'm not quite sure what it is. 🤔 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] kbendick commented on a change in pull request #1525: Provide API and Implementation for Creating Iceberg Tables from Spark

Reply via email to