[GitHub] [iceberg] wesselr opened a new issue #2853: Is it possible to write a partitioned and bucketed table with spark?

GitBox Fri, 23 Jul 2021 00:37:30 -0700


wesselr opened a new issue #2853:
URL: https://github.com/apache/iceberg/issues/2853



   Setup:
   - Spark: 3.0.2
   - Iceberg: 0.11.1
   
   Is it possible to write a table that is partitioned by day and bucketed by 
an id? I have success writing either a partitioned or bucket table but not 
combined. From what I understand I need to define a UDF for the days transform 
so that I can sort by it to have it in the same partitions otherwise I get 
`java.lang.IllegalStateException: Already closed files for partition`. 
   
   `import org.apache.iceberg.spark.IcebergSpark`
   `import org.apache.spark.sql.types.DataTypes`
   `IcebergSpark.registerBucketUDF(spark, "id_bucket10", DataTypes.LongType, 
10)`
   
   or
   
   `val bucketTransform = 
Transforms.bucket[java.lang.Long](Types.LongType.get(), 10)`
   `def bucketFunc(id: Long): Int = bucketTransform.apply(id)`
   `val id_bucket10 = spark.udf.register("id_bucket10", bucketFunc _)`
   
   Gives me the desired bucket udf for spark but I am struggling to create a 
similar udf for days.
   
   `val daysTransform = 
Transforms.day[java.sql.Timestamp](Types.TimestampType.withZone())`
   `def daysFunc(ts: java.sql.Timestamp): Int = daysTransform.apply(ts)`
   `val iceberg_days = spark.udf.register("iceberg_days", daysFunc _)`
   
   Gives me: `org.apache.spark.SparkException: Failed to execute user defined 
function (timestamp) => int)` which makes sense but I am obviously missing 
something to get `Date` returned rather than an `Int`.
   
   In a nutshell I am trying to achieve:
   `df.sort(expr("iceberg_days(ts)"), 
expr("id_bucket10(id)")).writeTo("hive.test.partition_and_bucket").append()` 
   
   Thanks,
   Wessel 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] wesselr opened a new issue #2853: Is it possible to write a partitioned and bucketed table with spark?

Reply via email to