[
https://issues.apache.org/jira/browse/BEAM-11731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276705#comment-17276705
]
Brian Hulette commented on BEAM-11731:
--------------------------------------
To summarize, we seem to have two issues with numpy 1.2.0:
- Actually a pyarrow issue (ARROW-11450), this is the cause of the
test_read_write_10_parquet failure
- numpy 1.20 deprecates some aliases for standard types
(https://numpy.org/doc/1.20/release/1.20.0-notes.html?highlight=release%20notes#using-the-aliases-of-builtin-types-like-np-int-is-deprecated).
This is the cause of the PreCommit Lint failure. We rely on these aliases in a
couple of places, most notably in beam schemas. The release notes say they are
deprecated, and not removed. Maybe this is just an issue with mypy warning us
about using a deprecated object? The fact that this isn't accompanied by any
test failures would indicate this one is not a serious problem for our users.
Actions I think we should take:
- Stop using np.bool, np.int, .. aliases (easy, not urgent)
- From now on use the next _minor_ version of numpy as our upper bound (i.e.
<1.21.0) instead of the next major version. Releases aren't that frequent and
we get a signal about these updates from the dependency check report.
- (If possible) restrict to numpy <1.20.0 when pyarrow <3.0 is used. pyarrow
3.0 works with numpy 1.20.0, but pyarrow <3.0 does not, even though their numpy
requirement allows it. If its possible we should make our setup.py work around
this.
- Cherry-pick <1.20.0 requirement to 2.28.0 branch.
- Update 2.27.0 blog post with a known issue with numpy 1.20.0 and parquet.
> numpy 1.20.0 breaks dataframe io_test.test_read_write_10_parquet and mypy
> -------------------------------------------------------------------------
>
> Key: BEAM-11731
> URL: https://issues.apache.org/jira/browse/BEAM-11731
> Project: Beam
> Issue Type: Bug
> Components: test-failures
> Reporter: Kyle Weaver
> Assignee: Brian Hulette
> Priority: P0
> Fix For: 2.28.0
>
> Time Spent: 50m
> Remaining Estimate: 0h
>
> pyarrow.lib.ArrowTypeError: ("Did not pass numpy.dtype object [while running
> '_WriteToPandas/WriteToFiles/ParDo(_WriteUnshardedRecordsFn)/ParDo(_WriteUnshardedRecordsFn)']",
> 'Conversion failed for column rank with type int64')
> https://ci-beam.apache.org/job/beam_PreCommit_Python_Commit/17083/testReport/junit/apache_beam.dataframe.io_test/IOTest/test_read_write_10_parquet/
--
This message was sent by Atlassian Jira
(v8.3.4#803005)