Hi everyone,

The Python SDK has a dependency on pyarrow [1], currently only used by
ParquetIO for its parquet reader and writer. The arrow project recently hit
a major milestone with their 1.0 release. They now make forward- and
backward- compatibility guarantees for the IPC format, which is very
exciting and useful! But they're not making similar guarantees for releases
of the arrow libraries. They intend for regular library releases (targeting
a 3 month cadence) to be major version bumps, with possible breaking API
changes [2].

If we only support a single major version of pyarrow, as we do for other
Python dependencies, this could present quite a challenge for any beam
users that also have their own pyarrow dependency. If Beam keeps up with
the latest arrow release, they'd have to upgrade pyarrow in lockstep with
Beam. Worse, if Beam *doesn't* keep its dependency up-to-date, our users
might be locked out of new features in pyarrow.

In order to alleviate this I think we should maintain support for multiple
major pyarrow versions, and make an effort to keep up with new Arrow
releases.

I've verified that every major release going back to our current lower
bound, 0.15.1, up to the latest 2.x release will work with the current
ParquetIO code*. So this should just be a matter of:
1) Expanding the bounds in setup.py
2) Adding test suites to run ParquetIO tests with older versions to catch
any regressions (In an offline discussion +Udi Meiri
<[email protected]> volunteered
to help out with this).

I went ahead and created BEAM-11211 to track this, but please let me know
if there are any objections or concerns.

Brian

* There's actually a small regression just in 1.x, it can't write with LZ4
compression, but this can be easily detected at pipeline construction time.

[1]
https://github.com/apache/beam/blob/d2980d9346f3c9180da6218cc2cfafe801a4c4fb/sdks/python/setup.py#L150
[2] https://arrow.apache.org/docs/format/Versioning.html

Reply via email to