Re: [DISCUSS] Use iceberg-rust for PyIceberg Bucket Transform

Ryan Blue Thu, 01 Aug 2024 12:22:38 -0700

In general, I think the idea of using iceberg-rust more from PyIceberg is
great. I think it will be a good path to pushing more things down to native
code.


What are the trade-offs of doing it this way? Is it that we would introduce
a hard dependency on iceberg-rust? Is that a risk that could make PyIceberg
unusable for some people? I don't think that would be a problem since we
already have a requirement for pyarrow for these cases.

On Thu, Aug 1, 2024 at 6:39 AM Sung Yun <[email protected]> wrote:

> Hi everyone,
>
> This is something I've been mulling about for a while and I thought this
> would be the right forum to discuss this topic as a follow up to a similar
> topic discussion thread on using python bindings from iceberg-rust to
> support pyiceberg.
>
> As soon as we released 0.7.0 which supports writes into tables with
> TimeTransform partitions
> <https://github.com/apache/iceberg-python/pull/784/files>, our
> prospective users started asking about the support for Bucket Transform
> partitions.
>
> Iceberg has a custom logic for Bucket partitions (Thanks for the link
> <https://iceberg.apache.org/spec/#bucket-transform-details> Fokko). I
> took a look into the Java code
> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/transforms/Bucket.java#L99>
> and I think it looks somewhat like:
>
> * mmh3_hash(val) mod (num_buckets)
>
> And has field type specific logic so that each type is hashed
> appropriately.
>
> Unfortunately there is no existing pyarrow compute function that does
> this, so I'd like to propose that we write the function in iceberg-rust
> that takes an Arrow Array reference and the bucket number as the input,
> that returns a new Arrow Array reference with the bucket values evaluated
> that corresponds to the input Arrow Array in the same order.
>
> When iceberg-rust becomes more mature, I believe that the same underlying
> transform function can be reused for bucket partitions within this
> repository, and in the interim we could support writes into Bucket
> partitioned tables on PyIceberg by exposing this function as a Python
> binding that we import into PyIceberg.
>
> I'd love to hear how folks feel about this idea!
>
>
> Cross posted Discussion on iceberg-rust: #514
> <https://github.com/apache/iceberg-rust/discussions/514>
>
>
> Sung
>


-- 
Ryan Blue
Databricks

Re: [DISCUSS] Use iceberg-rust for PyIceberg Bucket Transform

Reply via email to