Hi: Thanks Sung for raising this. Just as Ryan said, I'm also +1 for using more iceberg-rust in pyiceberg.
Is it that we would introduce a hard dependency on iceberg-rust? I think so. Is that a risk that could make PyIceberg unusable for some people? I don't > think that would be a problem since we already have a requirement for > pyarrow for these cases. +1. In fact, what I think more about is making iceberg-rust the backend of pyiceberg, and this is what me, xuanwo and Fokko had talked about in brainstorming several times. Not only bucket transform, but also not only FileIO in another thread initiated by xuanwo. I think these are all intermediate steps to our final goal. On Fri, Aug 2, 2024 at 3:21 AM Ryan Blue <b...@databricks.com.invalid> wrote: > In general, I think the idea of using iceberg-rust more from PyIceberg is > great. I think it will be a good path to pushing more things down to native > code. > > What are the trade-offs of doing it this way? Is it that we would > introduce a hard dependency on iceberg-rust? Is that a risk that could make > PyIceberg unusable for some people? I don't think that would be a problem > since we already have a requirement for pyarrow for these cases. > > On Thu, Aug 1, 2024 at 6:39 AM Sung Yun <sungwy...@gmail.com> wrote: > >> Hi everyone, >> >> This is something I've been mulling about for a while and I thought this >> would be the right forum to discuss this topic as a follow up to a similar >> topic discussion thread on using python bindings from iceberg-rust to >> support pyiceberg. >> >> As soon as we released 0.7.0 which supports writes into tables with >> TimeTransform partitions >> <https://github.com/apache/iceberg-python/pull/784/files>, our >> prospective users started asking about the support for Bucket Transform >> partitions. >> >> Iceberg has a custom logic for Bucket partitions (Thanks for the link >> <https://iceberg.apache.org/spec/#bucket-transform-details> Fokko). I >> took a look into the Java code >> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/transforms/Bucket.java#L99> >> and I think it looks somewhat like: >> >> * mmh3_hash(val) mod (num_buckets) >> >> And has field type specific logic so that each type is hashed >> appropriately. >> >> Unfortunately there is no existing pyarrow compute function that does >> this, so I'd like to propose that we write the function in iceberg-rust >> that takes an Arrow Array reference and the bucket number as the input, >> that returns a new Arrow Array reference with the bucket values evaluated >> that corresponds to the input Arrow Array in the same order. >> >> When iceberg-rust becomes more mature, I believe that the same underlying >> transform function can be reused for bucket partitions within this >> repository, and in the interim we could support writes into Bucket >> partitioned tables on PyIceberg by exposing this function as a Python >> binding that we import into PyIceberg. >> >> I'd love to hear how folks feel about this idea! >> >> >> Cross posted Discussion on iceberg-rust: #514 >> <https://github.com/apache/iceberg-rust/discussions/514> >> >> >> Sung >> > > > -- > Ryan Blue > Databricks >