Hi: Would pushing the bucket transforms into Rust a good exercise to get the > scaffolding in place?
To be honest, I don't think bucket transforms alone is not a good starting point, maybe a more general approach would be to provide another set of transform implementations backed by iceberg-rust? The class hierarchy may look this following: ``` Transform RustTransform BucketRustTransform YeanRustTransform .... ``` I think it would be a win-win situation where we can combine the momentum > of PyIceberg and push the things that Python doesn't do well (mostly > multithreading and heavy lifting) to Iceberg-Rust, and also have > Iceberg-Rust as a library for Rust query engines to interact with Iceberg > tables directly. +1. On Sat, Aug 3, 2024 at 2:20 AM Fokko Driesprong <fo...@apache.org> wrote: > Hey everyone, > > In the beginning of PyIceberg, one of the goals was to keep PyIceberg pure > Python. At some point, we've added a Cython Avro decoder because of > performance reasons, but we still have a pure Python fallback. Today you > can still do metadata operating using s3fs without any native code. Once > you start reading/writing data, you'll need to pull in PyArrow. > > I think it would be a win-win situation where we can combine the momentum > of PyIceberg and push the things that Python doesn't do well (mostly > multithreading and heavy lifting) to Iceberg-Rust, and also have > Iceberg-Rust as a library for Rust query engines to interact with Iceberg > tables directly. > > My main question here is about sequencing. Would pushing the bucket > transforms into Rust a good exercise to get the scaffolding in place? > > Kind regards, > Fokko > > > Op vr 2 aug 2024 om 16:38 schreef Renjie Liu <liurenjie2...@gmail.com>: > >> Hi: >> >> Thanks Sung for raising this. Just as Ryan said, I'm also +1 for using >> more iceberg-rust in pyiceberg. >> >> Is it that we would introduce a hard dependency on iceberg-rust? >> >> >> I think so. >> >> Is that a risk that could make PyIceberg unusable for some people? I >>> don't think that would be a problem since we already have a requirement for >>> pyarrow for these cases. >> >> >> +1. >> >> In fact, what I think more about is making iceberg-rust the backend of >> pyiceberg, and this is what me, xuanwo and Fokko had talked about in >> brainstorming several times. Not only bucket transform, but also not only >> FileIO in another thread initiated by xuanwo. I think these are all >> intermediate steps to our final goal. >> >> >> On Fri, Aug 2, 2024 at 3:21 AM Ryan Blue <b...@databricks.com.invalid> >> wrote: >> >>> In general, I think the idea of using iceberg-rust more from PyIceberg >>> is great. I think it will be a good path to pushing more things down to >>> native code. >>> >>> What are the trade-offs of doing it this way? Is it that we would >>> introduce a hard dependency on iceberg-rust? Is that a risk that could make >>> PyIceberg unusable for some people? I don't think that would be a problem >>> since we already have a requirement for pyarrow for these cases. >>> >>> On Thu, Aug 1, 2024 at 6:39 AM Sung Yun <sungwy...@gmail.com> wrote: >>> >>>> Hi everyone, >>>> >>>> This is something I've been mulling about for a while and I thought >>>> this would be the right forum to discuss this topic as a follow up to a >>>> similar topic discussion thread on using python bindings from iceberg-rust >>>> to support pyiceberg. >>>> >>>> As soon as we released 0.7.0 which supports writes into tables with >>>> TimeTransform partitions >>>> <https://github.com/apache/iceberg-python/pull/784/files>, our >>>> prospective users started asking about the support for Bucket Transform >>>> partitions. >>>> >>>> Iceberg has a custom logic for Bucket partitions (Thanks for the link >>>> <https://iceberg.apache.org/spec/#bucket-transform-details> Fokko). I >>>> took a look into the Java code >>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/transforms/Bucket.java#L99> >>>> and I think it looks somewhat like: >>>> >>>> * mmh3_hash(val) mod (num_buckets) >>>> >>>> And has field type specific logic so that each type is hashed >>>> appropriately. >>>> >>>> Unfortunately there is no existing pyarrow compute function that does >>>> this, so I'd like to propose that we write the function in iceberg-rust >>>> that takes an Arrow Array reference and the bucket number as the input, >>>> that returns a new Arrow Array reference with the bucket values evaluated >>>> that corresponds to the input Arrow Array in the same order. >>>> >>>> When iceberg-rust becomes more mature, I believe that the same >>>> underlying transform function can be reused for bucket partitions within >>>> this repository, and in the interim we could support writes into Bucket >>>> partitioned tables on PyIceberg by exposing this function as a Python >>>> binding that we import into PyIceberg. >>>> >>>> I'd love to hear how folks feel about this idea! >>>> >>>> >>>> Cross posted Discussion on iceberg-rust: #514 >>>> <https://github.com/apache/iceberg-rust/discussions/514> >>>> >>>> >>>> Sung >>>> >>> >>> >>> -- >>> Ryan Blue >>> Databricks >>> >>