[GitHub] [iceberg] kbendick commented on a change in pull request #1525: Provide API and Implementation for Creating Iceberg Tables from Spark

GitBox Mon, 28 Sep 2020 21:22:40 -0700


kbendick commented on a change in pull request #1525:
URL: https://github.com/apache/iceberg/pull/1525#discussion_r496356425




##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkSchemaUtil.java
##########
@@ -63,6 +65,21 @@ public static Schema schemaForTable(SparkSession spark, 
String name) {
     return new Schema(converted.asNestedType().asStructType().fields());
   }
 
+  /**
+   * Given a Spark table identifier, determine the PartitionSpec.
+   * @param spark the SparkSession which contains the identifier
+   * @param table a TableIdentifier, if the namespace is left blank the 
catalog().currentDatabase() will be used
+   * @return a IcebergPartitionSpec representing the partitioning of the Spark 
table
+   * @throws AnalysisException if thrown by the Spark catalog
+   */
+  public static PartitionSpec specForTable(SparkSession spark, TableIdentifier 
table) throws AnalysisException {
+    String db = table.database().nonEmpty() ? table.database().get() : 
spark.catalog().currentDatabase();
+    PartitionSpec spec = identitySpec(
+        schemaForTable(spark, table.unquotedString()),
+        spark.catalog().listColumns(db, table.table()).collectAsList());
+    return spec == null ? PartitionSpec.unpartitioned() : spec;

Review comment:
       EDIT: I goofed. We're creating iceberg tables from Spark tables. So my 
1st suggestion should be ignored entirely. Of course spark tables won't have 
hidden partition specs as that's an iceberg concept and we're using this to 
convert tables from spark to iceberg 🤦 . Is it possible to re-use a spark 
table's bucket based partitioning? I've never personally used it due to the 
small files problem it can generate, but are our hash functions for bucketing 
so different from the Spark bucketing (or some other issue) that we can't make 
it work - like, in a follow up PR?
   
   --- Original ---
   
   My understanding is that we're only searching for non-hidden partitions here 
because that's all that's possible to create from a table derived from 
catalyst's `TableIdentifier` which has no knowledge of hidden partitions and 
uses Hive style partitioning instead. That being said, I do still think that 
the JavaDoc could use an update emphasizing this. How about something like 
this...
   
   ```javadoc
   /**
    * Given a Spark table identifier, determine the PartitionSpec, which will 
be either
    * an identity or unpartitioned PartitionSpec based on the original table's 
hive-style
    * partition columns.
    */
   ```
   
   I'd love to hear other suggestions, but to me the JavaDoc seems ... somehow 
missing something. But I'm not quite sure what it is. 🤔 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] kbendick commented on a change in pull request #1525: Provide API and Implementation for Creating Iceberg Tables from Spark

Reply via email to