On Mon, Nov 9, 2020 at 2:55 PM Ahmet Altay <[email protected]> wrote: > This sounds reasonable. A few questions: > - Do we need to expand the test matrix every 3 months or so to add support > for different versions of pyarrow? >
Yes I think so. If there's a concern that this will become excessive we might consider just testing the newest and the oldest supported. - Which pyarrow version will we ship in the default container? > I think we should just ship the latest supported one since that's what any users without their own pyarrow dependency will use. - Related to the LZ4 regression, how did we catch this? If this is a one > off that is probably fine. It would make the less maintainable overtime if > we need to have branching code for different pyarrow versions. > It was caught by a unit test [1], but it's also documented in the release notes for arrow 1.0.0 [2] [1] https://github.com/apache/beam/blob/96610c9c0f56a21e4e06388bb83685131b3b1c55/sdks/python/apache_beam/io/parquetio_test.py#L335 [2] https://arrow.apache.org/blog/2020/07/24/1.0.0-release/ > On Mon, Nov 9, 2020 at 2:47 PM Brian Hulette <[email protected]> wrote: > >> Hi everyone, >> >> The Python SDK has a dependency on pyarrow [1], currently only used by >> ParquetIO for its parquet reader and writer. The arrow project recently hit >> a major milestone with their 1.0 release. They now make forward- and >> backward- compatibility guarantees for the IPC format, which is very >> exciting and useful! But they're not making similar guarantees for releases >> of the arrow libraries. They intend for regular library releases (targeting >> a 3 month cadence) to be major version bumps, with possible breaking API >> changes [2]. >> >> If we only support a single major version of pyarrow, as we do for other >> Python dependencies, this could present quite a challenge for any beam >> users that also have their own pyarrow dependency. If Beam keeps up with >> the latest arrow release, they'd have to upgrade pyarrow in lockstep with >> Beam. Worse, if Beam *doesn't* keep its dependency up-to-date, our users >> might be locked out of new features in pyarrow. >> >> In order to alleviate this I think we should maintain support for >> multiple major pyarrow versions, and make an effort to keep up with new >> Arrow releases. >> >> I've verified that every major release going back to our current lower >> bound, 0.15.1, up to the latest 2.x release will work with the current >> ParquetIO code*. So this should just be a matter of: >> 1) Expanding the bounds in setup.py >> 2) Adding test suites to run ParquetIO tests with older versions to catch >> any regressions (In an offline discussion +Udi Meiri <[email protected]> >> volunteered >> to help out with this). >> >> I went ahead and created BEAM-11211 to track this, but please let me know >> if there are any objections or concerns. >> >> Brian >> >> * There's actually a small regression just in 1.x, it can't write with >> LZ4 compression, but this can be easily detected at pipeline construction >> time. >> >> [1] >> https://github.com/apache/beam/blob/d2980d9346f3c9180da6218cc2cfafe801a4c4fb/sdks/python/setup.py#L150 >> [2] https://arrow.apache.org/docs/format/Versioning.html >> >
