In general, I think the idea of using iceberg-rust more from PyIceberg is great. I think it will be a good path to pushing more things down to native code.
What are the trade-offs of doing it this way? Is it that we would introduce a hard dependency on iceberg-rust? Is that a risk that could make PyIceberg unusable for some people? I don't think that would be a problem since we already have a requirement for pyarrow for these cases. On Thu, Aug 1, 2024 at 6:39 AM Sung Yun <sungwy...@gmail.com> wrote: > Hi everyone, > > This is something I've been mulling about for a while and I thought this > would be the right forum to discuss this topic as a follow up to a similar > topic discussion thread on using python bindings from iceberg-rust to > support pyiceberg. > > As soon as we released 0.7.0 which supports writes into tables with > TimeTransform partitions > <https://github.com/apache/iceberg-python/pull/784/files>, our > prospective users started asking about the support for Bucket Transform > partitions. > > Iceberg has a custom logic for Bucket partitions (Thanks for the link > <https://iceberg.apache.org/spec/#bucket-transform-details> Fokko). I > took a look into the Java code > <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/transforms/Bucket.java#L99> > and I think it looks somewhat like: > > * mmh3_hash(val) mod (num_buckets) > > And has field type specific logic so that each type is hashed > appropriately. > > Unfortunately there is no existing pyarrow compute function that does > this, so I'd like to propose that we write the function in iceberg-rust > that takes an Arrow Array reference and the bucket number as the input, > that returns a new Arrow Array reference with the bucket values evaluated > that corresponds to the input Arrow Array in the same order. > > When iceberg-rust becomes more mature, I believe that the same underlying > transform function can be reused for bucket partitions within this > repository, and in the interim we could support writes into Bucket > partitioned tables on PyIceberg by exposing this function as a Python > binding that we import into PyIceberg. > > I'd love to hear how folks feel about this idea! > > > Cross posted Discussion on iceberg-rust: #514 > <https://github.com/apache/iceberg-rust/discussions/514> > > > Sung > -- Ryan Blue Databricks