Re: [DISCUSS] Use iceberg-rust for PyIceberg Bucket Transform

Renjie Liu Fri, 02 Aug 2024 19:07:09 -0700

Hi:

Would pushing the bucket transforms into Rust a good exercise to get the
> scaffolding in place?



To be honest, I don't think bucket transforms alone is not a good starting
point, maybe a more general approach would be to provide another set of
transform implementations backed by iceberg-rust? The class hierarchy may
look this following:
```
Transform
   RustTransform
          BucketRustTransform
          YeanRustTransform
          ....
```

I think it would be a win-win situation where we can combine the momentum
> of PyIceberg and push the things that Python doesn't do well (mostly
> multithreading and heavy lifting) to Iceberg-Rust, and also have
> Iceberg-Rust as a library for Rust query engines to interact with Iceberg
> tables directly.


+1.




On Sat, Aug 3, 2024 at 2:20 AM Fokko Driesprong <fo...@apache.org> wrote:

> Hey everyone,
>
> In the beginning of PyIceberg, one of the goals was to keep PyIceberg pure
> Python. At some point, we've added a Cython Avro decoder because of
> performance reasons, but we still have a pure Python fallback. Today you
> can still do metadata operating using s3fs without any native code. Once
> you start reading/writing data, you'll need to pull in PyArrow.
>
> I think it would be a win-win situation where we can combine the momentum
> of PyIceberg and push the things that Python doesn't do well (mostly
> multithreading and heavy lifting) to Iceberg-Rust, and also have
> Iceberg-Rust as a library for Rust query engines to interact with Iceberg
> tables directly.
>
> My main question here is about sequencing. Would pushing the bucket
> transforms into Rust a good exercise to get the scaffolding in place?
>
> Kind regards,
> Fokko
>
>
> Op vr 2 aug 2024 om 16:38 schreef Renjie Liu <liurenjie2...@gmail.com>:
>
>> Hi:
>>
>> Thanks Sung for raising this. Just as Ryan said, I'm also +1 for using
>> more iceberg-rust in pyiceberg.
>>
>> Is it that we would introduce a hard dependency on iceberg-rust?
>>
>>
>> I think so.
>>
>> Is that a risk that could make PyIceberg unusable for some people? I
>>> don't think that would be a problem since we already have a requirement for
>>> pyarrow for these cases.
>>
>>
>> +1.
>>
>> In fact, what I think more about is making iceberg-rust the backend of
>> pyiceberg, and this is what me, xuanwo and Fokko had talked about in
>> brainstorming several times. Not only bucket transform, but also not only
>> FileIO in another thread initiated by xuanwo. I think these are all
>> intermediate steps to our final goal.
>>
>>
>> On Fri, Aug 2, 2024 at 3:21 AM Ryan Blue <b...@databricks.com.invalid>
>> wrote:
>>
>>> In general, I think the idea of using iceberg-rust more from PyIceberg
>>> is great. I think it will be a good path to pushing more things down to
>>> native code.
>>>
>>> What are the trade-offs of doing it this way? Is it that we would
>>> introduce a hard dependency on iceberg-rust? Is that a risk that could make
>>> PyIceberg unusable for some people? I don't think that would be a problem
>>> since we already have a requirement for pyarrow for these cases.
>>>
>>> On Thu, Aug 1, 2024 at 6:39 AM Sung Yun <sungwy...@gmail.com> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> This is something I've been mulling about for a while and I thought
>>>> this would be the right forum to discuss this topic as a follow up to a
>>>> similar topic discussion thread on using python bindings from iceberg-rust
>>>> to support pyiceberg.
>>>>
>>>> As soon as we released 0.7.0 which supports writes into tables with
>>>> TimeTransform partitions
>>>> <https://github.com/apache/iceberg-python/pull/784/files>, our
>>>> prospective users started asking about the support for Bucket Transform
>>>> partitions.
>>>>
>>>> Iceberg has a custom logic for Bucket partitions (Thanks for the link
>>>> <https://iceberg.apache.org/spec/#bucket-transform-details> Fokko). I
>>>> took a look into the Java code
>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/transforms/Bucket.java#L99>
>>>> and I think it looks somewhat like:
>>>>
>>>> * mmh3_hash(val) mod (num_buckets)
>>>>
>>>> And has field type specific logic so that each type is hashed
>>>> appropriately.
>>>>
>>>> Unfortunately there is no existing pyarrow compute function that does
>>>> this, so I'd like to propose that we write the function in iceberg-rust
>>>> that takes an Arrow Array reference and the bucket number as the input,
>>>> that returns a new Arrow Array reference with the bucket values evaluated
>>>> that corresponds to the input Arrow Array in the same order.
>>>>
>>>> When iceberg-rust becomes more mature, I believe that the same
>>>> underlying transform function can be reused for bucket partitions within
>>>> this repository, and in the interim we could support writes into Bucket
>>>> partitioned tables on PyIceberg by exposing this function as a Python
>>>> binding that we import into PyIceberg.
>>>>
>>>> I'd love to hear how folks feel about this idea!
>>>>
>>>>
>>>> Cross posted Discussion on iceberg-rust: #514
>>>> <https://github.com/apache/iceberg-rust/discussions/514>
>>>>
>>>>
>>>> Sung
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Databricks
>>>
>>

Re: [DISCUSS] Use iceberg-rust for PyIceberg Bucket Transform

Reply via email to