On Mon, Nov 9, 2020 at 3:37 PM Brian Hulette <[email protected]> wrote:

>
>
> On Mon, Nov 9, 2020 at 2:55 PM Ahmet Altay <[email protected]> wrote:
>
>> This sounds reasonable. A few questions:
>> - Do we need to expand the test matrix every 3 months or so to add
>> support for different versions of pyarrow?
>>
>
> Yes I think so. If there's a concern that this will become excessive we
> might consider just testing the newest and the oldest supported.
>

Sounds reasonable to me. And I suppose we can exclude certain major
versions in case we catch issues with them similar to the LZ4 regression.


>
> - Which pyarrow version will we ship in the default container?
>>
>
> I think we should just ship the latest supported one since that's what any
> users without their own pyarrow dependency will use.
>

+1


>
> - Related to the LZ4 regression, how did we catch this? If this is a one
>> off that is probably fine. It would make the less maintainable overtime if
>> we need to have branching code for different pyarrow versions.
>>
>
> It was caught by a unit test [1], but it's also documented in the release
> notes for arrow 1.0.0 [2]
>

It is great that it was caught by a unit test. We could easily things like
this from the release notes.


>
> [1]
> https://github.com/apache/beam/blob/96610c9c0f56a21e4e06388bb83685131b3b1c55/sdks/python/apache_beam/io/parquetio_test.py#L335
> [2] https://arrow.apache.org/blog/2020/07/24/1.0.0-release/
>
>
>> On Mon, Nov 9, 2020 at 2:47 PM Brian Hulette <[email protected]> wrote:
>>
>>> Hi everyone,
>>>
>>> The Python SDK has a dependency on pyarrow [1], currently only used by
>>> ParquetIO for its parquet reader and writer. The arrow project recently hit
>>> a major milestone with their 1.0 release. They now make forward- and
>>> backward- compatibility guarantees for the IPC format, which is very
>>> exciting and useful! But they're not making similar guarantees for releases
>>> of the arrow libraries. They intend for regular library releases (targeting
>>> a 3 month cadence) to be major version bumps, with possible breaking API
>>> changes [2].
>>>
>>> If we only support a single major version of pyarrow, as we do for other
>>> Python dependencies, this could present quite a challenge for any beam
>>> users that also have their own pyarrow dependency. If Beam keeps up with
>>> the latest arrow release, they'd have to upgrade pyarrow in lockstep with
>>> Beam. Worse, if Beam *doesn't* keep its dependency up-to-date, our users
>>> might be locked out of new features in pyarrow.
>>>
>>> In order to alleviate this I think we should maintain support for
>>> multiple major pyarrow versions, and make an effort to keep up with new
>>> Arrow releases.
>>>
>>> I've verified that every major release going back to our current lower
>>> bound, 0.15.1, up to the latest 2.x release will work with the current
>>> ParquetIO code*. So this should just be a matter of:
>>> 1) Expanding the bounds in setup.py
>>> 2) Adding test suites to run ParquetIO tests with older versions to
>>> catch any regressions (In an offline discussion +Udi Meiri
>>> <[email protected]> volunteered to help out with this).
>>>
>>> I went ahead and created BEAM-11211 to track this, but please let me
>>> know if there are any objections or concerns.
>>>
>>> Brian
>>>
>>> * There's actually a small regression just in 1.x, it can't write with
>>> LZ4 compression, but this can be easily detected at pipeline construction
>>> time.
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/d2980d9346f3c9180da6218cc2cfafe801a4c4fb/sdks/python/setup.py#L150
>>> [2] https://arrow.apache.org/docs/format/Versioning.html
>>>
>>

Reply via email to