Hey everyone, In the beginning of PyIceberg, one of the goals was to keep PyIceberg pure Python. At some point, we've added a Cython Avro decoder because of performance reasons, but we still have a pure Python fallback. Today you can still do metadata operating using s3fs without any native code. Once you start reading/writing data, you'll need to pull in PyArrow.
I think it would be a win-win situation where we can combine the momentum of PyIceberg and push the things that Python doesn't do well (mostly multithreading and heavy lifting) to Iceberg-Rust, and also have Iceberg-Rust as a library for Rust query engines to interact with Iceberg tables directly. My main question here is about sequencing. Would pushing the bucket transforms into Rust a good exercise to get the scaffolding in place? Kind regards, Fokko Op vr 2 aug 2024 om 16:38 schreef Renjie Liu <liurenjie2...@gmail.com>: > Hi: > > Thanks Sung for raising this. Just as Ryan said, I'm also +1 for using > more iceberg-rust in pyiceberg. > > Is it that we would introduce a hard dependency on iceberg-rust? > > > I think so. > > Is that a risk that could make PyIceberg unusable for some people? I don't >> think that would be a problem since we already have a requirement for >> pyarrow for these cases. > > > +1. > > In fact, what I think more about is making iceberg-rust the backend of > pyiceberg, and this is what me, xuanwo and Fokko had talked about in > brainstorming several times. Not only bucket transform, but also not only > FileIO in another thread initiated by xuanwo. I think these are all > intermediate steps to our final goal. > > > On Fri, Aug 2, 2024 at 3:21 AM Ryan Blue <b...@databricks.com.invalid> > wrote: > >> In general, I think the idea of using iceberg-rust more from PyIceberg is >> great. I think it will be a good path to pushing more things down to native >> code. >> >> What are the trade-offs of doing it this way? Is it that we would >> introduce a hard dependency on iceberg-rust? Is that a risk that could make >> PyIceberg unusable for some people? I don't think that would be a problem >> since we already have a requirement for pyarrow for these cases. >> >> On Thu, Aug 1, 2024 at 6:39 AM Sung Yun <sungwy...@gmail.com> wrote: >> >>> Hi everyone, >>> >>> This is something I've been mulling about for a while and I thought this >>> would be the right forum to discuss this topic as a follow up to a similar >>> topic discussion thread on using python bindings from iceberg-rust to >>> support pyiceberg. >>> >>> As soon as we released 0.7.0 which supports writes into tables with >>> TimeTransform partitions >>> <https://github.com/apache/iceberg-python/pull/784/files>, our >>> prospective users started asking about the support for Bucket Transform >>> partitions. >>> >>> Iceberg has a custom logic for Bucket partitions (Thanks for the link >>> <https://iceberg.apache.org/spec/#bucket-transform-details> Fokko). I >>> took a look into the Java code >>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/transforms/Bucket.java#L99> >>> and I think it looks somewhat like: >>> >>> * mmh3_hash(val) mod (num_buckets) >>> >>> And has field type specific logic so that each type is hashed >>> appropriately. >>> >>> Unfortunately there is no existing pyarrow compute function that does >>> this, so I'd like to propose that we write the function in iceberg-rust >>> that takes an Arrow Array reference and the bucket number as the input, >>> that returns a new Arrow Array reference with the bucket values evaluated >>> that corresponds to the input Arrow Array in the same order. >>> >>> When iceberg-rust becomes more mature, I believe that the same >>> underlying transform function can be reused for bucket partitions within >>> this repository, and in the interim we could support writes into Bucket >>> partitioned tables on PyIceberg by exposing this function as a Python >>> binding that we import into PyIceberg. >>> >>> I'd love to hear how folks feel about this idea! >>> >>> >>> Cross posted Discussion on iceberg-rust: #514 >>> <https://github.com/apache/iceberg-rust/discussions/514> >>> >>> >>> Sung >>> >> >> >> -- >> Ryan Blue >> Databricks >> >