Re: [PROPOSAL] Supporting multiple major pyarrow versions

Brian Hulette Mon, 09 Nov 2020 15:37:21 -0800

On Mon, Nov 9, 2020 at 2:55 PM Ahmet Altay <[email protected]> wrote:

> This sounds reasonable. A few questions:
> - Do we need to expand the test matrix every 3 months or so to add support
> for different versions of pyarrow?
>


Yes I think so. If there's a concern that this will become excessive we
might consider just testing the newest and the oldest supported.

- Which pyarrow version will we ship in the default container?
>

I think we should just ship the latest supported one since that's what any
users without their own pyarrow dependency will use.

- Related to the LZ4 regression, how did we catch this? If this is a one
> off that is probably fine. It would make the less maintainable overtime if
> we need to have branching code for different pyarrow versions.
>

It was caught by a unit test [1], but it's also documented in the release
notes for arrow 1.0.0 [2]

[1]
https://github.com/apache/beam/blob/96610c9c0f56a21e4e06388bb83685131b3b1c55/sdks/python/apache_beam/io/parquetio_test.py#L335
[2] https://arrow.apache.org/blog/2020/07/24/1.0.0-release/


> On Mon, Nov 9, 2020 at 2:47 PM Brian Hulette <[email protected]> wrote:
>
>> Hi everyone,
>>
>> The Python SDK has a dependency on pyarrow [1], currently only used by
>> ParquetIO for its parquet reader and writer. The arrow project recently hit
>> a major milestone with their 1.0 release. They now make forward- and
>> backward- compatibility guarantees for the IPC format, which is very
>> exciting and useful! But they're not making similar guarantees for releases
>> of the arrow libraries. They intend for regular library releases (targeting
>> a 3 month cadence) to be major version bumps, with possible breaking API
>> changes [2].
>>
>> If we only support a single major version of pyarrow, as we do for other
>> Python dependencies, this could present quite a challenge for any beam
>> users that also have their own pyarrow dependency. If Beam keeps up with
>> the latest arrow release, they'd have to upgrade pyarrow in lockstep with
>> Beam. Worse, if Beam *doesn't* keep its dependency up-to-date, our users
>> might be locked out of new features in pyarrow.
>>
>> In order to alleviate this I think we should maintain support for
>> multiple major pyarrow versions, and make an effort to keep up with new
>> Arrow releases.
>>
>> I've verified that every major release going back to our current lower
>> bound, 0.15.1, up to the latest 2.x release will work with the current
>> ParquetIO code*. So this should just be a matter of:
>> 1) Expanding the bounds in setup.py
>> 2) Adding test suites to run ParquetIO tests with older versions to catch
>> any regressions (In an offline discussion +Udi Meiri <[email protected]> 
>> volunteered
>> to help out with this).
>>
>> I went ahead and created BEAM-11211 to track this, but please let me know
>> if there are any objections or concerns.
>>
>> Brian
>>
>> * There's actually a small regression just in 1.x, it can't write with
>> LZ4 compression, but this can be easily detected at pipeline construction
>> time.
>>
>> [1]
>> https://github.com/apache/beam/blob/d2980d9346f3c9180da6218cc2cfafe801a4c4fb/sdks/python/setup.py#L150
>> [2] https://arrow.apache.org/docs/format/Versioning.html
>>
>

Re: [PROPOSAL] Supporting multiple major pyarrow versions

Reply via email to