This sounds reasonable. A few questions:
- Do we need to expand the test matrix every 3 months or so to add support
for different versions of pyarrow?
- Which pyarrow version will we ship in the default container?
- Related to the LZ4 regression, how did we catch this? If this is a one
off that is probably fine. It would make the less maintainable overtime if
we need to have branching code for different pyarrow versions.

On Mon, Nov 9, 2020 at 2:47 PM Brian Hulette <[email protected]> wrote:

> Hi everyone,
>
> The Python SDK has a dependency on pyarrow [1], currently only used by
> ParquetIO for its parquet reader and writer. The arrow project recently hit
> a major milestone with their 1.0 release. They now make forward- and
> backward- compatibility guarantees for the IPC format, which is very
> exciting and useful! But they're not making similar guarantees for releases
> of the arrow libraries. They intend for regular library releases (targeting
> a 3 month cadence) to be major version bumps, with possible breaking API
> changes [2].
>
> If we only support a single major version of pyarrow, as we do for other
> Python dependencies, this could present quite a challenge for any beam
> users that also have their own pyarrow dependency. If Beam keeps up with
> the latest arrow release, they'd have to upgrade pyarrow in lockstep with
> Beam. Worse, if Beam *doesn't* keep its dependency up-to-date, our users
> might be locked out of new features in pyarrow.
>
> In order to alleviate this I think we should maintain support for multiple
> major pyarrow versions, and make an effort to keep up with new Arrow
> releases.
>
> I've verified that every major release going back to our current lower
> bound, 0.15.1, up to the latest 2.x release will work with the current
> ParquetIO code*. So this should just be a matter of:
> 1) Expanding the bounds in setup.py
> 2) Adding test suites to run ParquetIO tests with older versions to catch
> any regressions (In an offline discussion +Udi Meiri <[email protected]> 
> volunteered
> to help out with this).
>
> I went ahead and created BEAM-11211 to track this, but please let me know
> if there are any objections or concerns.
>
> Brian
>
> * There's actually a small regression just in 1.x, it can't write with LZ4
> compression, but this can be easily detected at pipeline construction time.
>
> [1]
> https://github.com/apache/beam/blob/d2980d9346f3c9180da6218cc2cfafe801a4c4fb/sdks/python/setup.py#L150
> [2] https://arrow.apache.org/docs/format/Versioning.html
>

Reply via email to