On Mon, Nov 9, 2020 at 3:37 PM Brian Hulette <[email protected]> wrote:
> > > On Mon, Nov 9, 2020 at 2:55 PM Ahmet Altay <[email protected]> wrote: > >> This sounds reasonable. A few questions: >> - Do we need to expand the test matrix every 3 months or so to add >> support for different versions of pyarrow? >> > > Yes I think so. If there's a concern that this will become excessive we > might consider just testing the newest and the oldest supported. > Sounds reasonable to me. And I suppose we can exclude certain major versions in case we catch issues with them similar to the LZ4 regression. > > - Which pyarrow version will we ship in the default container? >> > > I think we should just ship the latest supported one since that's what any > users without their own pyarrow dependency will use. > +1 > > - Related to the LZ4 regression, how did we catch this? If this is a one >> off that is probably fine. It would make the less maintainable overtime if >> we need to have branching code for different pyarrow versions. >> > > It was caught by a unit test [1], but it's also documented in the release > notes for arrow 1.0.0 [2] > It is great that it was caught by a unit test. We could easily things like this from the release notes. > > [1] > https://github.com/apache/beam/blob/96610c9c0f56a21e4e06388bb83685131b3b1c55/sdks/python/apache_beam/io/parquetio_test.py#L335 > [2] https://arrow.apache.org/blog/2020/07/24/1.0.0-release/ > > >> On Mon, Nov 9, 2020 at 2:47 PM Brian Hulette <[email protected]> wrote: >> >>> Hi everyone, >>> >>> The Python SDK has a dependency on pyarrow [1], currently only used by >>> ParquetIO for its parquet reader and writer. The arrow project recently hit >>> a major milestone with their 1.0 release. They now make forward- and >>> backward- compatibility guarantees for the IPC format, which is very >>> exciting and useful! But they're not making similar guarantees for releases >>> of the arrow libraries. They intend for regular library releases (targeting >>> a 3 month cadence) to be major version bumps, with possible breaking API >>> changes [2]. >>> >>> If we only support a single major version of pyarrow, as we do for other >>> Python dependencies, this could present quite a challenge for any beam >>> users that also have their own pyarrow dependency. If Beam keeps up with >>> the latest arrow release, they'd have to upgrade pyarrow in lockstep with >>> Beam. Worse, if Beam *doesn't* keep its dependency up-to-date, our users >>> might be locked out of new features in pyarrow. >>> >>> In order to alleviate this I think we should maintain support for >>> multiple major pyarrow versions, and make an effort to keep up with new >>> Arrow releases. >>> >>> I've verified that every major release going back to our current lower >>> bound, 0.15.1, up to the latest 2.x release will work with the current >>> ParquetIO code*. So this should just be a matter of: >>> 1) Expanding the bounds in setup.py >>> 2) Adding test suites to run ParquetIO tests with older versions to >>> catch any regressions (In an offline discussion +Udi Meiri >>> <[email protected]> volunteered to help out with this). >>> >>> I went ahead and created BEAM-11211 to track this, but please let me >>> know if there are any objections or concerns. >>> >>> Brian >>> >>> * There's actually a small regression just in 1.x, it can't write with >>> LZ4 compression, but this can be easily detected at pipeline construction >>> time. >>> >>> [1] >>> https://github.com/apache/beam/blob/d2980d9346f3c9180da6218cc2cfafe801a4c4fb/sdks/python/setup.py#L150 >>> [2] https://arrow.apache.org/docs/format/Versioning.html >>> >>
