wesselr opened a new issue #2853:
URL: https://github.com/apache/iceberg/issues/2853
Setup:
- Spark: 3.0.2
- Iceberg: 0.11.1
Is it possible to write a table that is partitioned by day and bucketed by
an id? I have success writing either a partitioned or bucket table but not
combined. From what I understand I need to define a UDF for the days transform
so that I can sort by it to have it in the same partitions otherwise I get
`java.lang.IllegalStateException: Already closed files for partition`.
`import org.apache.iceberg.spark.IcebergSpark`
`import org.apache.spark.sql.types.DataTypes`
`IcebergSpark.registerBucketUDF(spark, "id_bucket10", DataTypes.LongType,
10)`
or
`val bucketTransform =
Transforms.bucket[java.lang.Long](Types.LongType.get(), 10)`
`def bucketFunc(id: Long): Int = bucketTransform.apply(id)`
`val id_bucket10 = spark.udf.register("id_bucket10", bucketFunc _)`
Gives me the desired bucket udf for spark but I am struggling to create a
similar udf for days.
`val daysTransform =
Transforms.day[java.sql.Timestamp](Types.TimestampType.withZone())`
`def daysFunc(ts: java.sql.Timestamp): Int = daysTransform.apply(ts)`
`val iceberg_days = spark.udf.register("iceberg_days", daysFunc _)`
Gives me: `org.apache.spark.SparkException: Failed to execute user defined
function (timestamp) => int)` which makes sense but I am obviously missing
something to get `Date` returned rather than an `Int`.
In a nutshell I am trying to achieve:
`df.sort(expr("iceberg_days(ts)"),
expr("id_bucket10(id)")).writeTo("hive.test.partition_and_bucket").append()`
Thanks,
Wessel
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]