Hey everyone,

In the beginning of PyIceberg, one of the goals was to keep PyIceberg pure
Python. At some point, we've added a Cython Avro decoder because of
performance reasons, but we still have a pure Python fallback. Today you
can still do metadata operating using s3fs without any native code. Once
you start reading/writing data, you'll need to pull in PyArrow.

I think it would be a win-win situation where we can combine the momentum
of PyIceberg and push the things that Python doesn't do well (mostly
multithreading and heavy lifting) to Iceberg-Rust, and also have
Iceberg-Rust as a library for Rust query engines to interact with Iceberg
tables directly.

My main question here is about sequencing. Would pushing the bucket
transforms into Rust a good exercise to get the scaffolding in place?

Kind regards,
Fokko


Op vr 2 aug 2024 om 16:38 schreef Renjie Liu <liurenjie2...@gmail.com>:

> Hi:
>
> Thanks Sung for raising this. Just as Ryan said, I'm also +1 for using
> more iceberg-rust in pyiceberg.
>
> Is it that we would introduce a hard dependency on iceberg-rust?
>
>
> I think so.
>
> Is that a risk that could make PyIceberg unusable for some people? I don't
>> think that would be a problem since we already have a requirement for
>> pyarrow for these cases.
>
>
> +1.
>
> In fact, what I think more about is making iceberg-rust the backend of
> pyiceberg, and this is what me, xuanwo and Fokko had talked about in
> brainstorming several times. Not only bucket transform, but also not only
> FileIO in another thread initiated by xuanwo. I think these are all
> intermediate steps to our final goal.
>
>
> On Fri, Aug 2, 2024 at 3:21 AM Ryan Blue <b...@databricks.com.invalid>
> wrote:
>
>> In general, I think the idea of using iceberg-rust more from PyIceberg is
>> great. I think it will be a good path to pushing more things down to native
>> code.
>>
>> What are the trade-offs of doing it this way? Is it that we would
>> introduce a hard dependency on iceberg-rust? Is that a risk that could make
>> PyIceberg unusable for some people? I don't think that would be a problem
>> since we already have a requirement for pyarrow for these cases.
>>
>> On Thu, Aug 1, 2024 at 6:39 AM Sung Yun <sungwy...@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> This is something I've been mulling about for a while and I thought this
>>> would be the right forum to discuss this topic as a follow up to a similar
>>> topic discussion thread on using python bindings from iceberg-rust to
>>> support pyiceberg.
>>>
>>> As soon as we released 0.7.0 which supports writes into tables with
>>> TimeTransform partitions
>>> <https://github.com/apache/iceberg-python/pull/784/files>, our
>>> prospective users started asking about the support for Bucket Transform
>>> partitions.
>>>
>>> Iceberg has a custom logic for Bucket partitions (Thanks for the link
>>> <https://iceberg.apache.org/spec/#bucket-transform-details> Fokko). I
>>> took a look into the Java code
>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/transforms/Bucket.java#L99>
>>> and I think it looks somewhat like:
>>>
>>> * mmh3_hash(val) mod (num_buckets)
>>>
>>> And has field type specific logic so that each type is hashed
>>> appropriately.
>>>
>>> Unfortunately there is no existing pyarrow compute function that does
>>> this, so I'd like to propose that we write the function in iceberg-rust
>>> that takes an Arrow Array reference and the bucket number as the input,
>>> that returns a new Arrow Array reference with the bucket values evaluated
>>> that corresponds to the input Arrow Array in the same order.
>>>
>>> When iceberg-rust becomes more mature, I believe that the same
>>> underlying transform function can be reused for bucket partitions within
>>> this repository, and in the interim we could support writes into Bucket
>>> partitioned tables on PyIceberg by exposing this function as a Python
>>> binding that we import into PyIceberg.
>>>
>>> I'd love to hear how folks feel about this idea!
>>>
>>>
>>> Cross posted Discussion on iceberg-rust: #514
>>> <https://github.com/apache/iceberg-rust/discussions/514>
>>>
>>>
>>> Sung
>>>
>>
>>
>> --
>> Ryan Blue
>> Databricks
>>
>

Reply via email to