Hi everyone, The Python SDK has a dependency on pyarrow [1], currently only used by ParquetIO for its parquet reader and writer. The arrow project recently hit a major milestone with their 1.0 release. They now make forward- and backward- compatibility guarantees for the IPC format, which is very exciting and useful! But they're not making similar guarantees for releases of the arrow libraries. They intend for regular library releases (targeting a 3 month cadence) to be major version bumps, with possible breaking API changes [2].
If we only support a single major version of pyarrow, as we do for other Python dependencies, this could present quite a challenge for any beam users that also have their own pyarrow dependency. If Beam keeps up with the latest arrow release, they'd have to upgrade pyarrow in lockstep with Beam. Worse, if Beam *doesn't* keep its dependency up-to-date, our users might be locked out of new features in pyarrow. In order to alleviate this I think we should maintain support for multiple major pyarrow versions, and make an effort to keep up with new Arrow releases. I've verified that every major release going back to our current lower bound, 0.15.1, up to the latest 2.x release will work with the current ParquetIO code*. So this should just be a matter of: 1) Expanding the bounds in setup.py 2) Adding test suites to run ParquetIO tests with older versions to catch any regressions (In an offline discussion +Udi Meiri <[email protected]> volunteered to help out with this). I went ahead and created BEAM-11211 to track this, but please let me know if there are any objections or concerns. Brian * There's actually a small regression just in 1.x, it can't write with LZ4 compression, but this can be easily detected at pipeline construction time. [1] https://github.com/apache/beam/blob/d2980d9346f3c9180da6218cc2cfafe801a4c4fb/sdks/python/setup.py#L150 [2] https://arrow.apache.org/docs/format/Versioning.html
