damccorm opened a new pull request, #36495: URL: https://github.com/apache/beam/pull/36495
This drops the determinism requirement for GBEK coders from an error to a warning. This matches what GBK does today, which is important because users should be able to just drop in a `--gbek` pipeline option and have things just work. Today, some of our built-in beam transforms fail with this left in. For example, without this change, [testDataframeSum](https://github.com/apache/beam/blob/d54a661f47e87c894f84a7cf63fac03bae6f3ec3/sdks/java/extensions/python/src/test/java/org/apache/beam/sdk/extensions/python/transforms/DataframeTransformTest.java#L37) fails with: ``` java.lang.RuntimeException: Traceback (most recent call last): File "apache_beam/coders/coder_impl.py", line 540, in apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_special_deterministic File "apache_beam/coders/coder_impl.py", line 460, in apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_to_stream File "apache_beam/coders/coder_impl.py", line 481, in apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_to_stream File "apache_beam/coders/coder_impl.py", line 544, in apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_special_deterministic TypeError: Unable to deterministically encode 'BlockManager Items: Index(['b'], dtype='object') Axis 1: Index([100], dtype='int64', name='a') NumpyBlock: slice(0, 1, 1), 1 x 1, dtype: int32' of type '<class 'pandas.core.internals.managers.BlockManager'>', please provide a type hint for the input of 'GroupByEncryptedKey Group by encrypted keyThe key coder is not deterministic. This may result in incorrect pipeline output. This can be fixed by adding a type hint to the operation preceding the GroupByKey step, and for custom key classes, by writing a deterministic custom Coder. Please see the documentation for more details.' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "apache_beam/runners/common.py", line 1498, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 684, in apache_beam.runners.common.SimpleInvoker.invoke_process File "apache_beam/runners/common.py", line 1673, in apache_beam.runners.common._OutputHandler.handle_process_outputs File "/usr/local/lib/python3.13/site-packages/apache_beam/transforms/util.py", line 444, in process encoded_value = self.value_coder.encode(v) File "/usr/local/lib/python3.13/site-packages/apache_beam/coders/coders.py", line 459, in encode return self.get_impl().encode(value) ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^ File "apache_beam/coders/coder_impl.py", line 237, in apache_beam.coders.coder_impl.StreamCoderImpl.encode File "apache_beam/coders/coder_impl.py", line 240, in apache_beam.coders.coder_impl.StreamCoderImpl.encode File "apache_beam/coders/coder_impl.py", line 1120, in apache_beam.coders.coder_impl.AbstractComponentCoderImpl.encode_to_stream File "apache_beam/coders/coder_impl.py", line 481, in apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_to_stream File "apache_beam/coders/coder_impl.py", line 542, in apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_special_deterministic TypeError: Unable to deterministically encode ' b a 100 3' of type '<class 'pandas.core.frame.DataFrame'>', please provide a type hint for the input of 'GroupByEncryptedKey Group by encrypted keyThe key coder is not deterministic. This may result in incorrect pipeline output. This can be fixed by adding a type hint to the operation preceding the GroupByKey step, and for custom key classes, by writing a deterministic custom Coder. Please see the documentation for more details.' During handling of the above exception, another exception occurred: ``` I'd assume other dataframe tests fail similarly. ------------------------ Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] Mention the appropriate issue in your description (for example: `addresses #123`), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment `fixes #<ISSUE NUMBER>` instead. - [ ] Update `CHANGES.md` with noteworthy changes. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://github.com/apache/beam/blob/master/CONTRIBUTING.md#make-the-reviewers-job-easier). To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md) GitHub Actions Tests Status (on master branch) ------------------------------------------------------------------------------------------------ [](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule) [](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule) [](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule) [](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule) See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI or the [workflows README](https://github.com/apache/beam/blob/master/.github/workflows/README.md) to see a list of phrases to trigger workflows. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
