This sounds reasonable. A few questions: - Do we need to expand the test matrix every 3 months or so to add support for different versions of pyarrow? - Which pyarrow version will we ship in the default container? - Related to the LZ4 regression, how did we catch this? If this is a one off that is probably fine. It would make the less maintainable overtime if we need to have branching code for different pyarrow versions.
On Mon, Nov 9, 2020 at 2:47 PM Brian Hulette <[email protected]> wrote: > Hi everyone, > > The Python SDK has a dependency on pyarrow [1], currently only used by > ParquetIO for its parquet reader and writer. The arrow project recently hit > a major milestone with their 1.0 release. They now make forward- and > backward- compatibility guarantees for the IPC format, which is very > exciting and useful! But they're not making similar guarantees for releases > of the arrow libraries. They intend for regular library releases (targeting > a 3 month cadence) to be major version bumps, with possible breaking API > changes [2]. > > If we only support a single major version of pyarrow, as we do for other > Python dependencies, this could present quite a challenge for any beam > users that also have their own pyarrow dependency. If Beam keeps up with > the latest arrow release, they'd have to upgrade pyarrow in lockstep with > Beam. Worse, if Beam *doesn't* keep its dependency up-to-date, our users > might be locked out of new features in pyarrow. > > In order to alleviate this I think we should maintain support for multiple > major pyarrow versions, and make an effort to keep up with new Arrow > releases. > > I've verified that every major release going back to our current lower > bound, 0.15.1, up to the latest 2.x release will work with the current > ParquetIO code*. So this should just be a matter of: > 1) Expanding the bounds in setup.py > 2) Adding test suites to run ParquetIO tests with older versions to catch > any regressions (In an offline discussion +Udi Meiri <[email protected]> > volunteered > to help out with this). > > I went ahead and created BEAM-11211 to track this, but please let me know > if there are any objections or concerns. > > Brian > > * There's actually a small regression just in 1.x, it can't write with LZ4 > compression, but this can be easily detected at pipeline construction time. > > [1] > https://github.com/apache/beam/blob/d2980d9346f3c9180da6218cc2cfafe801a4c4fb/sdks/python/setup.py#L150 > [2] https://arrow.apache.org/docs/format/Versioning.html >
