Re: [PR] Bump com.gradle.common-custom-user-data-gradle-plugin from 1.12.1 to 2 [beam]
github-actions[bot] commented on PR #30865: URL: https://github.com/apache/beam/pull/30865#issuecomment-2038982164 Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment `assign set of reviewers` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Bump com.gradle.common-custom-user-data-gradle-plugin from 1.12.1 to 1.13 [beam]
dependabot[bot] closed pull request #30451: Bump com.gradle.common-custom-user-data-gradle-plugin from 1.12.1 to 1.13 URL: https://github.com/apache/beam/pull/30451 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Bump com.gradle.common-custom-user-data-gradle-plugin from 1.12.1 to 1.13 [beam]
dependabot[bot] commented on PR #30451: URL: https://github.com/apache/beam/pull/30451#issuecomment-2038921935 Superseded by #30865. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Bump com.gradle.common-custom-user-data-gradle-plugin from 1.12.1 to 2 [beam]
dependabot[bot] opened a new pull request, #30865: URL: https://github.com/apache/beam/pull/30865 Bumps com.gradle.common-custom-user-data-gradle-plugin from 1.12.1 to 2. [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=com.gradle.common-custom-user-data-gradle-plugin=gradle=1.12.1=2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Tune maximum thread count for streaming dataflow worker executor dynamically. [beam]
MelodyShen commented on PR #30439: URL: https://github.com/apache/beam/pull/30439#issuecomment-2038766063 Hi @scwhittle thanks for reviewing the changes. I have rebased master to catch up with the latest and all checks passed. Would you mind merging the PR when available? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Create YAML Join Transform [beam]
Polber commented on PR #30734: URL: https://github.com/apache/beam/pull/30734#issuecomment-2038579469 @robertwb @itodotimothy6 FYI this breaks on master due to * https://github.com/apache/beam/pull/30681 I have only been able to repro by using any `SqlBackedProvider` with a query that joins 2 inputs (not just this PR). I have a fix here: * https://github.com/apache/beam/pull/30864 but since it doesn't get hit in any of my other repro attempts, I think there may also be an issue with how the environments get constructed using these providers (but my context on the manner is limited). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] merging non-canonical environments fails [beam]
github-actions[bot] commented on PR #30864: URL: https://github.com/apache/beam/pull/30864#issuecomment-2038577539 Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] merging non-canonical environments fails [beam]
Polber commented on PR #30864: URL: https://github.com/apache/beam/pull/30864#issuecomment-2038576770 R: @robertwb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] merging non-canonical environments fails [beam]
Polber opened a new pull request, #30864: URL: https://github.com/apache/beam/pull/30864 There are cases where environments are specified at the transform/windowing level that are not included in the top-level components' environments. This is particularly present when running sql-backed provider on multiple inputs (i.e. when joining 2 pcollections) in Beam YAML which creates 2 default environments, but only declares one in the proto.components.environments section. Perhaps the real issue lies in how Beam YAML providers create transforms, but this is a nice safeguard regardless. Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] Mention the appropriate issue in your description (for example: `addresses #123`), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment `fixes #` instead. - [ ] Update `CHANGES.md` with noteworthy changes. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://github.com/apache/beam/blob/master/CONTRIBUTING.md#make-the-reviewers-job-easier). To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md) GitHub Actions Tests Status (on master branch) [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule) [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule) [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule) [![Go tests](https://github.com/apache/beam/workflows/Go%20tests/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule) See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI or the [workflows README](https://github.com/apache/beam/blob/master/.github/workflows/README.md) to see a list of phrases to trigger workflows. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [do not merge] remove --pre for testing purposes [beam]
github-actions[bot] commented on PR #30859: URL: https://github.com/apache/beam/pull/30859#issuecomment-2038487214 Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment `assign set of reviewers` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Tune maximum thread count for streaming dataflow worker executor dynamically. [beam]
codecov-commenter commented on PR #30439: URL: https://github.com/apache/beam/pull/30439#issuecomment-2038472128 ## [Codecov](https://app.codecov.io/gh/apache/beam/pull/30439?dropdown=coverage=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache) Report Attention: Patch coverage is `86.20690%` with `8 lines` in your changes are missing coverage. Please review. > Project coverage is 62.87%. Comparing base [(`0d41168`)](https://app.codecov.io/gh/apache/beam/commit/0d41168a0963869df037e468f80f5ab8466a99ab?dropdown=coverage=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache) to head [(`d499074`)](https://app.codecov.io/gh/apache/beam/pull/30439?dropdown=coverage=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache). > Report is 1 commits behind head on master. | [Files](https://app.codecov.io/gh/apache/beam/pull/30439?dropdown=coverage=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache) | Patch % | Lines | |---|---|---| | [...reaming/harness/StreamingWorkerStatusReporter.java](https://app.codecov.io/gh/apache/beam/pull/30439?src=pr=tree=runners%2Fgoogle-cloud-dataflow-java%2Fworker%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fbeam%2Frunners%2Fdataflow%2Fworker%2Fstreaming%2Fharness%2FStreamingWorkerStatusReporter.java_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache#diff-cnVubmVycy9nb29nbGUtY2xvdWQtZGF0YWZsb3ctamF2YS93b3JrZXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2JlYW0vcnVubmVycy9kYXRhZmxvdy93b3JrZXIvc3RyZWFtaW5nL2hhcm5lc3MvU3RyZWFtaW5nV29ya2VyU3RhdHVzUmVwb3J0ZXIuamF2YQ==) | 83.87% | [3 Missing and 2 partials :warning: ](https://app.codecov.io/gh/apache/beam/pull/30439?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache) | | [...ers/dataflow/worker/util/BoundedQueueExecutor.java](https://app.codecov.io/gh/apache/beam/pull/30439?src=pr=tree=runners%2Fgoogle-cloud-dataflow-java%2Fworker%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fbeam%2Frunners%2Fdataflow%2Fworker%2Futil%2FBoundedQueueExecutor.java_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache#diff-cnVubmVycy9nb29nbGUtY2xvdWQtZGF0YWZsb3ctamF2YS93b3JrZXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2JlYW0vcnVubmVycy9kYXRhZmxvdy93b3JrZXIvdXRpbC9Cb3VuZGVkUXVldWVFeGVjdXRvci5qYXZh) | 86.95% | [2 Missing and 1 partial :warning: ](https://app.codecov.io/gh/apache/beam/pull/30439?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache) | Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #30439 +/- ## = + Coverage 56.86% 62.87%+6.00% - Complexity 148514773+13288 = Files 501 2207 +1706 Lines 46219 153504 +107285 Branches 107611735+10659 = + Hits 2628396511+70228 - Misses1791850840+32922 - Partials 2018 6153 +4135 ``` | [Flag](https://app.codecov.io/gh/apache/beam/pull/30439/flags?src=pr=flags_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache) | Coverage Δ | | |---|---|---| | [java](https://app.codecov.io/gh/apache/beam/pull/30439/flags?src=pr=flag_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache) | `65.69% <86.20%> (-3.13%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache#carryforward-flags-in-the-pull-request-comment) to find out more. [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/beam/pull/30439?dropdown=coverage=pr=continue_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [Java] ManagedIO [beam]
chamikaramj commented on PR #30808: URL: https://github.com/apache/beam/pull/30808#issuecomment-2038454155 Also, please fix spotless/lint failures. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Use empty flags for default expansion service options. [beam]
robertwb merged PR #30858: URL: https://github.com/apache/beam/pull/30858 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [Java] ManagedIO [beam]
chamikaramj commented on code in PR #30808: URL: https://github.com/apache/beam/pull/30808#discussion_r1552590182 ## sdks/java/managed/src/main/java/org/apache/beam/sdk/managed/Managed.java: ## @@ -0,0 +1,185 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.managed; + +import com.google.auto.value.AutoValue; +import java.util.ArrayList; +import java.util.List; +import java.util.Map; +import javax.annotation.Nullable; +import org.apache.beam.sdk.schemas.transforms.SchemaTransform; +import org.apache.beam.sdk.schemas.utils.YamlUtils; +import org.apache.beam.sdk.values.PCollectionRowTuple; +import org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.annotations.VisibleForTesting; +import org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.base.Preconditions; +import org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.collect.ImmutableMap; + +public class Managed { + public static final String ICEBERG = "iceberg"; + + public static Read read(String source) { + +return new AutoValue_Managed_Read.Builder() +.setSource( +Preconditions.checkNotNull( +Read.TRANSFORMS.get(source.toLowerCase()), +"An unsupported source was specified: '%s'. Please specify one of the following source: %s", +source, +Read.TRANSFORMS.keySet())) +.setSupportedIdentifiers(new ArrayList<>(Read.TRANSFORMS.values())) +.build(); + } + + @AutoValue + public abstract static class Read extends SchemaTransform { +public static final Map TRANSFORMS = +ImmutableMap.builder() +.put(ICEBERG, "beam:schematransform:org.apache.beam:iceberg_read:v1") +.build(); + +abstract String getSource(); + +abstract @Nullable String getConfig(); + +abstract @Nullable String getConfigUrl(); + +abstract List getSupportedIdentifiers(); + +abstract Builder toBuilder(); + +@AutoValue.Builder +abstract static class Builder { + abstract Builder setSource(String source); + + abstract Builder setConfig(@Nullable String config); + + abstract Builder setConfigUrl(@Nullable String configUrl); + + abstract Builder setSupportedIdentifiers(List supportedIdentifiers); + + abstract Read build(); +} + +public Read withConfigUrl(String configUrl) { Review Comment: Please add Java docs for `withConfig` methods. ## sdks/java/managed/src/main/java/org/apache/beam/sdk/managed/Managed.java: ## @@ -0,0 +1,185 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.managed; + +import com.google.auto.value.AutoValue; +import java.util.ArrayList; +import java.util.List; +import java.util.Map; +import javax.annotation.Nullable; +import org.apache.beam.sdk.schemas.transforms.SchemaTransform; +import org.apache.beam.sdk.schemas.utils.YamlUtils; +import org.apache.beam.sdk.values.PCollectionRowTuple; +import org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.annotations.VisibleForTesting; +import org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.base.Preconditions; +import org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.collect.ImmutableMap; + +public class Managed { + public static final String ICEBERG = "iceberg"; + + public static Read read(String source) { + +return new AutoValue_Managed_Read.Builder() +
Re: [PR] Add a lower bound on pydantic [beam]
github-actions[bot] commented on PR #30863: URL: https://github.com/apache/beam/pull/30863#issuecomment-2038448445 Assigning reviewers. If you would like to opt out of this review, comment `assign to next reviewer`: R: @damccorm for label python. Available commands: - `stop reviewer notifications` - opt out of the automated review tooling - `remind me after tests pass` - tag the comment author after tests pass - `waiting on author` - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers) The PR bot will only process comments in the main thread (not review comments). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [YAML] Interpret PDone as no outputs. [beam]
github-actions[bot] commented on PR #30862: URL: https://github.com/apache/beam/pull/30862#issuecomment-2038435261 Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [YAML] Interpret PDone as no outputs. [beam]
robertwb commented on PR #30862: URL: https://github.com/apache/beam/pull/30862#issuecomment-2038434517 R: @ffernandez92 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [do not merge] remove --pre for testing purposes [beam]
github-actions[bot] commented on PR #30859: URL: https://github.com/apache/beam/pull/30859#issuecomment-2038414942 ## Test Results 7 files - 43 7 suites - 43 20m 52s :stopwatch: + 12m 31s 7 573 tests +7 190 6 691 :white_check_mark: +6 310 882 :zzz: +880 0 :x: ±0 7 709 runs +7 326 6 721 :white_check_mark: +6 340 988 :zzz: +986 0 :x: ±0 Results for commit d7cfc782. ± Comparison against base commit 0d41168a. This pull request removes 383 and adds 7573 tests. Note that renamed tests count towards both. ``` org.apache.beam.runners.core.SimplePushbackSideInputDoFnRunnerTest ‑ testGarbageCollectForStatefulDoFnRunner org.apache.beam.runners.core.SimplePushbackSideInputDoFnRunnerTest ‑ testLateDroppingForStatefulDoFnRunner org.apache.beam.runners.direct.WriteWithShardingFactoryTest ‑ keyBasedOnCountFnFewElementsExtraShards org.apache.beam.runners.direct.WriteWithShardingFactoryTest ‑ keyBasedOnCountFnFewElementsThreeShards org.apache.beam.runners.direct.WriteWithShardingFactoryTest ‑ keyBasedOnCountFnManyElements org.apache.beam.runners.direct.WriteWithShardingFactoryTest ‑ keyBasedOnCountFnManyElementsExtraShards org.apache.beam.runners.direct.WriteWithShardingFactoryTest ‑ keyBasedOnCountFnWithNoElements org.apache.beam.runners.direct.WriteWithShardingFactoryTest ‑ keyBasedOnCountFnWithOneElement org.apache.beam.runners.direct.WriteWithShardingFactoryTest ‑ keyBasedOnCountFnWithTwoElements org.apache.beam.sdk.PipelineTest ‑ testIdentityTransform … ``` ``` apache_beam.coders.coders_property_based_test.ProperyTestingCoders ‑ test_float_coder apache_beam.coders.coders_property_based_test.ProperyTestingCoders ‑ test_row_coder apache_beam.coders.coders_property_based_test.ProperyTestingCoders ‑ test_string_coder apache_beam.coders.coders_property_based_test.TypesAreAllTested ‑ test_all_types_are_tested apache_beam.coders.coders_test.AvroCoderTest ‑ test_avro_record_coder apache_beam.coders.coders_test.CodersTest ‑ test_str_utf8_coder apache_beam.coders.coders_test.DeterministicProtoCoderTest ‑ test_deterministic_proto_coder apache_beam.coders.coders_test.DeterministicProtoCoderTest ‑ test_deterministic_proto_coder_determinism apache_beam.coders.coders_test.FallbackCoderTest ‑ test_default_fallback_path apache_beam.coders.coders_test.NullableCoderTest ‑ test_determinism … ``` This pull request removes 2 skipped tests and adds 882 skipped tests. Note that renamed tests count towards both. ``` org.apache.beam.sdk.transforms.SplittableDoFnTest ‑ testLateData org.apache.beam.sdk.values.PDoneTest ‑ testEmptyTransform ``` ``` apache_beam.coders.fast_coders_test.FastCoders ‑ test_using_fast_impl apache_beam.coders.row_coder_test.RowCoderTest ‑ test_overflows apache_beam.coders.slow_coders_test.SlowCoders ‑ test_using_slow_impl apache_beam.coders.typecoders_test.TypeCodersTest ‑ test_list_coder apache_beam.dataframe.frames_test.AggregationTest ‑ test_agg_min_count apache_beam.dataframe.frames_test.AggregationTest ‑ test_dataframe_agg_level apache_beam.dataframe.frames_test.AggregationTest ‑ test_dataframe_agg_level_bool_only apache_beam.dataframe.frames_test.AggregationTest ‑ test_dataframe_agg_level_numeric_only apache_beam.dataframe.frames_test.AggregationTest ‑ test_dataframe_agg_multifunc_level apache_beam.dataframe.frames_test.AggregationTest ‑ test_series_agg_level … ``` [test-results]:data:application/gzip;base64,H4sIAO4xD2YC/03MTQ6DIBCG4asY1l0AijP0MoYOkpD60yCsmt69E6vU5ftM5nuLEKdxE/cGbo3YSsw1fEkux3XhVNpoFr7l/WqgPXPYChFb31v1t2d8sSHWryG4ODHJCmNKazoklWXfBWmPqrOgVaXfqkU85TK693WT1nmOmUN4oECA2viO0EDnrFdOyg69DYD2IdsAQEDi8wV66BFFDgEAAA== -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Add a lower bound on pydantic [beam]
tvalentyn opened a new pull request, #30863: URL: https://github.com/apache/beam/pull/30863 fixes: #30852 Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] Mention the appropriate issue in your description (for example: `addresses #123`), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment `fixes #` instead. - [ ] Update `CHANGES.md` with noteworthy changes. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://github.com/apache/beam/blob/master/CONTRIBUTING.md#make-the-reviewers-job-easier). To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md) GitHub Actions Tests Status (on master branch) [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule) [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule) [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule) [![Go tests](https://github.com/apache/beam/workflows/Go%20tests/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule) See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI or the [workflows README](https://github.com/apache/beam/blob/master/.github/workflows/README.md) to see a list of phrases to trigger workflows. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Failing Test]: beam_PreCommit_Python_Coverage suite fails with ModuleNotFoundError: No module named 'pydantic._hypothesis_plugin' [beam]
tvalentyn commented on issue #30852: URL: https://github.com/apache/beam/issues/30852#issuecomment-2038404492 > the fact that old version of pydantic gets installed due to --pre flag may or may not be a factor. that _is_ the factor. The pydantic-2.0a4 distribution has the following: (py38b) :py38b$ cat lib/python3.8/site-packages/pydantic-2.0a4.dist-info/entry_points.txt [hypothesis] _ = pydantic._hypothesis_plugin -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [YAML] Interpret PDone as no outputs. [beam]
robertwb opened a new pull request, #30862: URL: https://github.com/apache/beam/pull/30862 This should fix https://github.com/apache/beam/issues/30446 Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] Mention the appropriate issue in your description (for example: `addresses #123`), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment `fixes #` instead. - [ ] Update `CHANGES.md` with noteworthy changes. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://github.com/apache/beam/blob/master/CONTRIBUTING.md#make-the-reviewers-job-easier). To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md) GitHub Actions Tests Status (on master branch) [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule) [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule) [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule) [![Go tests](https://github.com/apache/beam/workflows/Go%20tests/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule) See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI or the [workflows README](https://github.com/apache/beam/blob/master/.github/workflows/README.md) to see a list of phrases to trigger workflows. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Proposed edits for Beam YAML overview [beam]
robertwb commented on code in PR #30842: URL: https://github.com/apache/beam/pull/30842#discussion_r1552524970 ## website/www/site/content/en/documentation/sdks/yaml.md: ## @@ -23,80 +23,132 @@ title: "Apache Beam YAML API" # Beam YAML API -While Beam provides powerful APIs for authoring sophisticated data -processing pipelines, it often still has too high a barrier for -getting started and authoring simple pipelines. Even setting up the -environment, installing the dependencies, and setting up the project -can be an overwhelming amount of boilerplate for some (though -https://beam.apache.org/blog/beam-starter-projects/ has gone a long -way in making this easier). - -Here we provide a simple declarative syntax for describing pipelines -that does not require coding experience or learning how to use an -SDKany text editor will do. -Some installation may be required to actually *execute* a pipeline, but -we envision various services (such as Dataflow) to accept yaml pipelines -directly obviating the need for even that in the future. -We also anticipate the ability to generate code directly from these -higher-level yaml descriptions, should one want to graduate to a full -Beam SDK (and possibly the other direction as well as far as possible). - -Though we intend this syntax to be easily authored (and read) directly by -humans, this may also prove a useful intermediate representation for -tools to use as well, either as output (e.g. a pipeline authoring GUI) -or consumption (e.g. a lineage analysis tool) and expect it to be more -easily manipulated and semantically meaningful than the Beam protos -themselves (which concern themselves more with execution). - -It should be noted that everything here is still under development, but any -features already included are considered stable. Feedback is welcome at -d...@apache.beam.org. - -## Running pipelines - -The Beam yaml parser is currently included as part of the Apache Beam Python SDK. -This can be installed (e.g. within a virtual environment) as +Beam YAML is a declarative syntax for describing Apache Beam pipelines by using +YAML files. You can use Beam YAML to author and run a Beam pipeline without +writing any code. + +## Overview + +Beam provides a powerful model for creating sophisticated data processing +pipelines. However, getting started with Beam programming can be challenging +because it requires writing code in one of the supported Beam SDK languages. +You need to understand the APIs, set up a project, manage dependencies, and +perform other programming tasks. + +Beam YAML makes it easier to get started with creating Beam pipelines. Instead +of writing code, you create a YAML file using any text editor. Then you submit +the YAML file to be executed by a runner. + +The Beam YAML syntax is designed to be human-readable but also suitable as an +intermediate representation for tools. For example, a pipeline authoring GUI +could output YAML, or a lineage analysis tool could consume the YAML pipeline +specifications. + +Beam YAML is still under development, but any features already included are +considered stable. Feedback is welcome at d...@apache.beam.org. + +## Prerequisites + +The Beam YAML parser is currently included as part of the +[Apache Beam Python SDK](../python/). You don't need to write Python code to use +Beam YAML, but you need the SDK to run pipelines locally. + +We recommend creating a +[virtual environment](../../../get-started/quickstart/python/#create-and-activate-a-virtual-environment) +so that all packages are installed in an isolated and self-contained +environment. After you set up your Python environment, install the SDK as +follows: ``` pip install apache_beam[yaml,gcp] ``` -In addition, several of the provided transforms (such as SQL) are implemented -in Java and their expansion will require a working Java interpeter. (The -requisite artifacts will be automatically downloaded from the apache maven -repositories, so no further installs will be required.) -Docker is also currently required for local execution of these -cross-language-requiring transforms, but not for submission to a non-local -runner such as Flink or Dataflow. +In addition, several of the provided transforms, such as the SQL transform, are +implemented in Java and require a working Java interpeter. When you a run a +pipeline with these transforms, the required artifacts are automatically +downloaded from the Apache Maven repositories. To execute these cross-language +transforms locally, you must have Docker installed on your local machine. Review Comment: The part about Docker is no longer true since https://github.com/apache/beam/pull/29283 . (Are there other places we should be updating this as well?) ## website/www/site/content/en/documentation/sdks/yaml.md: ## @@ -23,80 +23,132 @@ title: "Apache Beam YAML API" # Beam YAML API -While Beam provides powerful APIs for authoring sophisticated data
Re: [PR] Create YAML Join Transform [beam]
itodotimothy6 commented on code in PR #30734: URL: https://github.com/apache/beam/pull/30734#discussion_r1552520613 ## sdks/python/apache_beam/yaml/yaml_join.py: ## @@ -0,0 +1,161 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""This module defines the Join operation.""" +import networkx as nx Review Comment: ignore above comment. I removed networkx in latest commit -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [do not merge] remove --pre for testing purposes [beam]
tvalentyn commented on PR #30859: URL: https://github.com/apache/beam/pull/30859#issuecomment-2038220335 Run Python_Coverage PreCommit -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Correct the version spec [beam]
github-actions[bot] commented on PR #30856: URL: https://github.com/apache/beam/pull/30856#issuecomment-2038219740 Assigning reviewers. If you would like to opt out of this review, comment `assign to next reviewer`: R: @riteshghorse for label python. Available commands: - `stop reviewer notifications` - opt out of the automated review tooling - `remind me after tests pass` - tag the comment author after tests pass - `waiting on author` - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers) The PR bot will only process comments in the main thread (not review comments). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [do not merge] remove --pre for testing purposes [beam]
tvalentyn commented on PR #30859: URL: https://github.com/apache/beam/pull/30859#issuecomment-2038219066 Run Python_Coverage PreCommit -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Failing Test]: beam_PreCommit_Python_Coverage suite fails with ModuleNotFoundError: No module named 'pydantic._hypothesis_plugin' [beam]
tvalentyn commented on issue #30852: URL: https://github.com/apache/beam/issues/30852#issuecomment-2038213168 The hypothesis plugin might be provided by hypothesis itself, and we might need it for tests that use hypothesis. But looks like Pydantic 2 doesn't work with hypothesis: https://github.com/pydantic/pydantic/discussions/5979 , and somehow pydantic (which i believe we didn't have in our dependency chain before), now intefers with pytest/hypothesis. the fact that old version of pydantic gets installed due to `--pre` flag may or may not be a factor. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [do not merge] remove --pre for testing purposes [beam]
tvalentyn opened a new pull request, #30859: URL: https://github.com/apache/beam/pull/30859 **Please** add a meaningful description for your change here Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] Mention the appropriate issue in your description (for example: `addresses #123`), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment `fixes #` instead. - [ ] Update `CHANGES.md` with noteworthy changes. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://github.com/apache/beam/blob/master/CONTRIBUTING.md#make-the-reviewers-job-easier). To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md) GitHub Actions Tests Status (on master branch) [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule) [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule) [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule) [![Go tests](https://github.com/apache/beam/workflows/Go%20tests/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule) See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI or the [workflows README](https://github.com/apache/beam/blob/master/.github/workflows/README.md) to see a list of phrases to trigger workflows. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Use empty flags for default expansion service options. [beam]
github-actions[bot] commented on PR #30858: URL: https://github.com/apache/beam/pull/30858#issuecomment-2038213399 Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Use empty flags for default expansion service options. [beam]
robertwb commented on PR #30858: URL: https://github.com/apache/beam/pull/30858#issuecomment-2038210512 R: @Polber -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Use empty flags for default expansion service options. [beam]
robertwb opened a new pull request, #30858: URL: https://github.com/apache/beam/pull/30858 The cli-invoking variant already parses and passes these in manually. Internal uses should not automatically pick up `sys.argv`. Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] Mention the appropriate issue in your description (for example: `addresses #123`), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment `fixes #` instead. - [ ] Update `CHANGES.md` with noteworthy changes. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://github.com/apache/beam/blob/master/CONTRIBUTING.md#make-the-reviewers-job-easier). To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md) GitHub Actions Tests Status (on master branch) [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule) [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule) [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule) [![Go tests](https://github.com/apache/beam/workflows/Go%20tests/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule) See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI or the [workflows README](https://github.com/apache/beam/blob/master/.github/workflows/README.md) to see a list of phrases to trigger workflows. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Initial Iceberg Sink [beam]
chamikaramj commented on code in PR #30797: URL: https://github.com/apache/beam/pull/30797#discussion_r1552407817 ## sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/AppendFilesToTables.java: ## @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.io.iceberg; + +import org.apache.beam.sdk.coders.KvCoder; +import org.apache.beam.sdk.coders.SerializableCoder; +import org.apache.beam.sdk.coders.StringUtf8Coder; +import org.apache.beam.sdk.transforms.DoFn; +import org.apache.beam.sdk.transforms.GroupByKey; +import org.apache.beam.sdk.transforms.PTransform; +import org.apache.beam.sdk.transforms.ParDo; +import org.apache.beam.sdk.transforms.SerializableFunction; +import org.apache.beam.sdk.transforms.WithKeys; +import org.apache.beam.sdk.transforms.windowing.BoundedWindow; +import org.apache.beam.sdk.values.KV; +import org.apache.beam.sdk.values.PCollection; +import org.apache.iceberg.AppendFiles; +import org.apache.iceberg.Snapshot; +import org.apache.iceberg.Table; +import org.apache.iceberg.catalog.Catalog; +import org.apache.iceberg.catalog.TableIdentifier; +import org.checkerframework.checker.nullness.qual.MonotonicNonNull; + +class AppendFilesToTables +extends PTransform, PCollection>> { + + private final IcebergCatalogConfig catalogConfig; + + AppendFilesToTables(IcebergCatalogConfig catalogConfig) { +this.catalogConfig = catalogConfig; + } + + @Override + public PCollection> expand(PCollection writtenFiles) { + +// Apply any sharded writes and flatten everything for catalog updates +return writtenFiles +.apply( +"Key metadata updates by table", +WithKeys.of( +new SerializableFunction() { + @Override + public String apply(FileWriteResult input) { +return input.getTableIdentifier().toString(); + } +})) +// .setCoder(KvCoder.of(StringUtf8Coder.of(), new MetadataUpdate.MetadataUpdateCoder())) Review Comment: Uncomment or delete. ## sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/OneTableDynamicDestinations.java: ## @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.io.iceberg; + +import org.apache.beam.sdk.schemas.Schema; +import org.apache.beam.sdk.values.Row; +import org.apache.iceberg.FileFormat; +import org.apache.iceberg.catalog.TableIdentifier; +import org.checkerframework.checker.nullness.qual.MonotonicNonNull; + +class OneTableDynamicDestinations implements DynamicDestinations { + + private static final Schema EMPTY_SCHEMA = Schema.builder().build(); + private static final Row EMPTY_ROW = Row.nullRow(EMPTY_SCHEMA); + + // TableId represented as String for serializability + private final String tableIdString; + + private transient @MonotonicNonNull TableIdentifier tableId; + + private TableIdentifier getTableIdentifier() { +if (tableId == null) { + tableId = TableIdentifier.parse(tableIdString); +} +return tableId; + } + + OneTableDynamicDestinations(TableIdentifier tableId) { +this.tableIdString = tableId.toString(); + } + + @Override + public Schema getMetadataSchema() { +return EMPTY_SCHEMA; + } + + @Override + public Row assignDestinationMetadata(Row data) { +return EMPTY_ROW; + } + + @Override + public IcebergDestination
Re: [PR] Add Redistribute transform to model, Java SDK, and most active runners [beam]
kennknowles commented on code in PR #30545: URL: https://github.com/apache/beam/pull/30545#discussion_r1541202528 ## runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkBatchTransformTranslators.java: ## @@ -423,6 +431,76 @@ public void translateNode( } } + private static class RedistributeByKeyTranslatorBatch + implements FlinkBatchPipelineTranslator.BatchTransformTranslator< + Redistribute.RedistributeByKey> { + +@Override +public void translateNode( +Redistribute.RedistributeByKey transform, FlinkBatchTranslationContext context) { + final DataSet>> inputDataSet = + context.getInputDataSet(context.getInput(transform)); + // Construct an instance of CoderTypeInformation which contains the pipeline options. + // This will be used to initialized FileSystems. + final CoderTypeInformation>> outputType = + ((CoderTypeInformation>>) inputDataSet.getType()) + .withPipelineOptions(context.getPipelineOptions()); + // We insert a NOOP here to initialize the FileSystems via the above CoderTypeInformation. Review Comment: I couldn't find anything. Guess I'll dive into the history of the line this is copied from and see if there was something clear. Might be obsolete too. ## runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineTranslator.java: ## @@ -917,6 +919,41 @@ private void groupByKeyAndSortValuesHelper( } }); +registerTransformTranslator( +RedistributeByKey.class, Review Comment: I don't understand the question? Does this help? - `Redistribute.arbitrarily()` has the same composite structure as `Reshuffle.viaRandomKey()` - `Redistribute.byKey()` has the same composite structure as `Reshuffle.of()` - This code here is the Dataflow v1 translation for `Redistribute.byKey()` which is simplified from the translation of `Reshuffle.of()` - The proposal for the future is to do even better for the `arbitrarily` case by having Dataflow expose a primitive rather than the existing cludge on top of GroupByKey. This thread didn't mention it but now I realize there's a potential problem because the purpose of the override was to save some data shuffled by not reifying the timestamps since they are available elsewhere in Dataflow-specific shuffle metadata. Now I'm on the fence, because I'd rather not rely on that always being the case, as it would be update-incompatible to change it, whereas reifying all metadata in a standard way is robust to changes. ## runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkStreamingPortablePipelineTranslator.java: ## @@ -301,6 +305,24 @@ private void translateReshuffle( Iterables.getOnlyElement(transform.getOutputsMap().values()), inputDataStream.rebalance()); } + private void translateRedistributeByKey( + String id, RunnerApi.Pipeline pipeline, StreamingTranslationContext context) { +RunnerApi.PTransform transform = pipeline.getComponents().getTransformsOrThrow(id); +DataStream>> inputDataStream = + context.getDataStreamOrThrow(Iterables.getOnlyElement(transform.getInputsMap().values())); +context.addDataStream( +Iterables.getOnlyElement(transform.getOutputsMap().values()), inputDataStream.rebalance()); Review Comment: Yea, we could share the same implementation. I followed my rule that code should be shared if it is representing the same thing by logical necessity, otherwise not shared. In this case there are two very similar things that are temporarily having the same implementation. I don't care too much, could re-use the same lines of code for now until we choose to diverge. I have a slight preference for keeping them separate to make it obvious that there is no logical necessity that they be in sync. ## runners/flink/flink_runner.gradle: ## @@ -309,6 +311,8 @@ def createValidatesRunnerTask(Map m) { // Flink reshuffle override does not preserve all metadata excludeTestsMatching 'org.apache.beam.sdk.transforms.ReshuffleTest.testReshufflePreservesMetadata' +// Flink redistribute override does not preserve all metadata Review Comment: Yea, I don't know what all is needed. I'll take another look and see if it is obvious. ## runners/samza/src/main/java/org/apache/beam/runners/samza/translation/ReshuffleTranslator.java: ## @@ -42,6 +42,16 @@ public class ReshuffleTranslator implements TransformTranslator>, PCollection>>> { + private final String prefix; + + ReshuffleTranslator(String prefix) { +this.prefix = prefix; + } + + ReshuffleTranslator() { +this("rhfl-"); Review Comment: It was not! Thank you ## sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Redistribute.java: ## @@ -0,0 +1,300 @@ +/* + *
Re: [I] [Failing Test]: beam_PreCommit_Python_Coverage suite fails with ModuleNotFoundError: No module named 'pydantic._hypothesis_plugin' [beam]
tvalentyn commented on issue #30852: URL: https://github.com/apache/beam/issues/30852#issuecomment-2038186835 https://stackoverflow.com/questions/71394400/how-to-block-the-hypothesis-pytest-plugin has some discussion how to disable it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Failing Test]: beam_PreCommit_Python_Coverage suite fails with ModuleNotFoundError: No module named 'pydantic._hypothesis_plugin' [beam]
tvalentyn commented on issue #30852: URL: https://github.com/apache/beam/issues/30852#issuecomment-2038183904 even though pip selects a bizzare version for pydantic, pydantic 2, this pydantic-hypothesis plugin seems broken `from pydantic import _hypothesis_plugin` fails, the more correct import `from pydantic.v1 import _hypothesis_plugin` also fails with recent versions of hypthesis -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Implement LengthPrefixCoder.to_type_hint [beam]
damccorm commented on PR #30559: URL: https://github.com/apache/beam/pull/30559#issuecomment-2038162485 SGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Implement LengthPrefixCoder.to_type_hint [beam]
damccorm closed pull request #30559: Implement LengthPrefixCoder.to_type_hint URL: https://github.com/apache/beam/pull/30559 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Implement LengthPrefixCoder.to_type_hint [beam]
github-actions[bot] commented on PR #30857: URL: https://github.com/apache/beam/pull/30857#issuecomment-2038161468 Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Implement LengthPrefixCoder.to_type_hint [beam]
hjtran commented on PR #30559: URL: https://github.com/apache/beam/pull/30559#issuecomment-2038160409 I've just created a duplicate PR without the rebase issues https://github.com/apache/beam/pull/30857 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] byte-buddy:1.14.12 [beam]
damccorm merged PR #30746: URL: https://github.com/apache/beam/pull/30746 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Implement LengthPrefixCoder.to_type_hint [beam]
hjtran commented on PR #30857: URL: https://github.com/apache/beam/pull/30857#issuecomment-2038159581 R: @damccorm -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Implement LengthPrefixCoder.to_type_hint [beam]
hjtran opened a new pull request, #30857: URL: https://github.com/apache/beam/pull/30857 (Duplicate of another [PR](https://github.com/apache/beam/pull/30559) but without rebase issues) LengthPrefixCoder doesn't define to_type_hint even though it seems pretty well defined. I've implemented it by just inferring it from its value coder. Motivating use case - we have some debug tools that infer the types of pcollections based on to_type_hint. The LPCoder is used in a lot of places. Bonus/unrelated change - I've pulled out the GitHub issue referenced in the NotImplementedError raised by to_type_hint. Every time I run into this error, I check the github issue and get more confused. It doesn't seem useful to refer to it in the error message, but does still seem relevant, so I pulled it out into a comment. Testing Done I haven't actually run the new unit test I wrote since I'm struggling to get my python sdk dev environment up (I've posted on the dev@ mailing list). Hoping the GH Action can just do it for me :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Implement LengthPrefixCoder.to_type_hint [beam]
damccorm commented on PR #30559: URL: https://github.com/apache/beam/pull/30559#issuecomment-2038146227 Uh oh, looks like a bad rebase -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Failing Test]: beam_PreCommit_Python_Coverage suite fails with ModuleNotFoundError: No module named 'pydantic._hypothesis_plugin' [beam]
tvalentyn commented on issue #30852: URL: https://github.com/apache/beam/issues/30852#issuecomment-2038144463 < - pydantic [required: <3, installed: 2.0a4] this seems to be an incorrect installer behavior. 2.0a4 shouln't be installed under these constraints. Actually I misread this, it still fits the range but chose version is strange, there might be more constraints. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Failing Test]: beam_PreCommit_Python_Coverage suite fails with ModuleNotFoundError: No module named 'pydantic._hypothesis_plugin' [beam]
tvalentyn commented on issue #30852: URL: https://github.com/apache/beam/issues/30852#issuecomment-2038142141 from pipdeptree: ``` * google-cloud-aiplatform==1.46.0 - proto-plus [required: >=1.22.0,<2.0.0dev, installed: 1.24.0.dev0] - pydantic [required: <3, installed: 2.0a4] ``` this seems to be an incorrect installer behavior. 2.0a4 shouln't be installed under these constraints. not sure how that happens. Likely what triggered the error for us was a recent release in https://pypi.org/project/google-cloud-aiplatform/#history , which added the pydantic dependency. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Bug]: Beam Sql is ignoring aliases fields in some situations which causes to huge data loss [beam]
kennknowles commented on issue #30498: URL: https://github.com/apache/beam/issues/30498#issuecomment-2038130587 First step is I want to figure out if it is a Beam bug or a Calcite bug. I expect it to be a Beam bug. A likely source of the problem would be something like https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamAggregateProjectMergeRule.java This is a rule that merges aggregation and projects, which seems to be what happens here. We have our own version of the rule because Beam has special projection pushdown into IOs. Just as an example of how I am thinking about debugging this. I will now read that file and report back if I see anything obvious. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Implement LengthPrefixCoder.to_type_hint [beam]
hjtran commented on PR #30559: URL: https://github.com/apache/beam/pull/30559#issuecomment-2038124884 > I'll merge later after checks pass Thanks for all the reviews! I really appreciate it! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Failing Test]: beam_PreCommit_Python_Coverage suite fails with ModuleNotFoundError: No module named 'pydantic._hypothesis_plugin' [beam]
tvalentyn commented on issue #30852: URL: https://github.com/apache/beam/issues/30852#issuecomment-2038121140 Beam test infra installs pre-released dependencies to detect possible issues ahead of releases. The comand: ``` pip install --pre "tensorflow_transform>=1.13.0,<1.14.0" apache-beam[gcp,test] ``` installs pydantic==2.0a4 The command ``` pip install "tensorflow_transform>=1.13.0,<1.14.0" apache-beam[gcp,test] ``` installs pydantic==1.10.15 The tft requirement comes from: https://github.com/apache/beam/blob/21129a41e031c150c3f610639d71a95a3a941243/sdks/python/tox.ini#L316 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Implement LengthPrefixCoder.to_type_hint [beam]
damccorm commented on PR #30559: URL: https://github.com/apache/beam/pull/30559#issuecomment-2038117336 I'll merge later after checks pass -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [Testing] Try fixing inference benchmark tests [beam]
riteshghorse commented on PR #30855: URL: https://github.com/apache/beam/pull/30855#issuecomment-2038101504 Run Inference Benchmarks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Correct the version spec [beam]
tvalentyn opened a new pull request, #30856: URL: https://github.com/apache/beam/pull/30856 It is incorrect to use `<=` in this case, see: https://cwiki.apache.org/confluence/display/BEAM/Dependency+management+guidelines+for+Beam+Python+SDK+maintainers -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Implement LengthPrefixCoder.to_type_hint [beam]
hjtran commented on code in PR #30559: URL: https://github.com/apache/beam/pull/30559#discussion_r1552348156 ## local-env-setup.sh: ## @@ -24,7 +24,7 @@ darwin_install_pip3_packages() { install_go_packages(){ echo "Installing goavro" -go get github.com/linkedin/goavro/v2 +go install github.com/linkedin/goavro@latest Review Comment: No, this was accidental. Had to make this change to get my dev container to work. [Mailing list thread](https://lists.apache.org/thread/4lp8nqppd2xv8m694oyjzzjqvwtor4gl) Will revert -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [Testing] Try fixing inference benchmark tests [beam]
github-actions[bot] commented on PR #30855: URL: https://github.com/apache/beam/pull/30855#issuecomment-2038099474 Stopping reviewer notifications for this pull request: requested by reviewer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [Testing] Try fixing inference benchmark tests [beam]
riteshghorse commented on PR #30855: URL: https://github.com/apache/beam/pull/30855#issuecomment-2038097863 Run Inference Benchmarks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [Testing] Try fixing inference benchmark tests [beam]
riteshghorse commented on PR #30855: URL: https://github.com/apache/beam/pull/30855#issuecomment-2038096484 stop reviewer notifications -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Implement LengthPrefixCoder.to_type_hint [beam]
damccorm commented on code in PR #30559: URL: https://github.com/apache/beam/pull/30559#discussion_r1552338484 ## local-env-setup.sh: ## @@ -24,7 +24,7 @@ darwin_install_pip3_packages() { install_go_packages(){ echo "Installing goavro" -go get github.com/linkedin/goavro/v2 +go install github.com/linkedin/goavro@latest Review Comment: Did you mean to make this change? I think we need get here not just install -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [Testing] Try fixing inference benchmark tests [beam]
riteshghorse commented on PR #30855: URL: https://github.com/apache/beam/pull/30855#issuecomment-2038084704 Run Inference Benchmarks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [Testing] Try fixing inference benchmark tests [beam]
riteshghorse opened a new pull request, #30855: URL: https://github.com/apache/beam/pull/30855 **Please** add a meaningful description for your change here Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] Mention the appropriate issue in your description (for example: `addresses #123`), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment `fixes #` instead. - [ ] Update `CHANGES.md` with noteworthy changes. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://github.com/apache/beam/blob/master/CONTRIBUTING.md#make-the-reviewers-job-easier). To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md) GitHub Actions Tests Status (on master branch) [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule) [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule) [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule) [![Go tests](https://github.com/apache/beam/workflows/Go%20tests/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule) See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI or the [workflows README](https://github.com/apache/beam/blob/master/.github/workflows/README.md) to see a list of phrases to trigger workflows. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [Testing] Try fixing inference benchmark tests [beam]
riteshghorse commented on PR #30855: URL: https://github.com/apache/beam/pull/30855#issuecomment-2038055456 Run Inference Benchmarks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Fix Jms drop record [beam]
Abacn commented on PR #30218: URL: https://github.com/apache/beam/pull/30218#issuecomment-2038050053 #30853 appears working. Basically it sets a timeout after that the checkpoint is forced to be finalized. This is unsafe in general, but it indeed released the records stuck in the session internal buffer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Feature Request]: Support for Apache Flink 1.18 in Beam Runners [beam]
thebozzcl commented on issue #30789: URL: https://github.com/apache/beam/issues/30789#issuecomment-2038042064 Also, please note that the codebase in #30197 has diverged from the main branch quite a bit, so I'm not even sure my changes will apply once that's rebased. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [Java] ManagedIO [beam]
chamikaramj commented on code in PR #30808: URL: https://github.com/apache/beam/pull/30808#discussion_r1552304475 ## sdks/java/managed/src/main/java/org/apache/beam/sdk/managed/Managed.java: ## @@ -0,0 +1,160 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.managed; + +import com.google.auto.value.AutoValue; +import java.util.Map; +import java.util.regex.Pattern; +import javax.annotation.Nullable; +import org.apache.beam.sdk.schemas.transforms.SchemaTransform; +import org.apache.beam.sdk.values.PCollectionRowTuple; + +public class Managed { + public static Read read() { Review Comment: `p.apply(Managed.read(ManagedIO.ICEBERG).withConfig(config))` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Feature Request]: Support for Apache Flink 1.18 in Beam Runners [beam]
thebozzcl commented on issue #30789: URL: https://github.com/apache/beam/issues/30789#issuecomment-2038040661 I forked from that PR to add support for 1.18. It's a trivial change, too: https://github.com/thebozzcl/beam/commit/432643bfd70f1b8a6052c33a04774b93da5a7bae So far this is working fine in our tests, with a caveat - in its current status, there's some incompatibilities with the publicly released libraries (mostly around `SerializablePipelineOptions`). For our tests, I had to use custom-built versions of a few libraries based on the repo: * beam-runners-core-java * beam-runners-java-fn-execution * beam-runners-java-job-service * beam-sdks-java-core I'm hopeful that once the PR for 1.17 is merged, adding support for 1.18 will be trivial. We're going to try using these libraries in one of our more complex pipelines, see if everything still works. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [Failing Test]: Running `which java` seems to cause flakes in some YAML tests on GHA runners [beam]
tvalentyn opened a new issue, #30854: URL: https://github.com/apache/beam/issues/30854 ### What happened? In stacktrace below, process seems to got stuck when running subprocess.run(['which', 'java']. Filing to track this issue if it is common. cc: @Polber ``` AggregationTest.test_combine_mean_minimal_yaml [gw1] linux -- Python 3.9.19 /runner/_work/beam/beam/sdks/python/test-suites/tox/py39/build/srcs/sdks/python/target/.tox-py39/py39/bin/python self = @mock.patch('apache_beam.Pipeline', TestPipeline) def test_yaml_example(self): with open(pipeline_spec_file, encoding="utf-8") as f: lines = f.readlines() expected_key = '# Expected:\n' if expected_key in lines: expected = lines[lines.index('# Expected:\n') + 1:] else: raise ValueError( f"Missing '# Expected:' tag in example file '{pipeline_spec_file}'") for i, line in enumerate(expected): expected[i] = line.replace('# ', '').replace('\n', '') pipeline_spec = yaml.load( ''.join(lines), Loader=yaml_transform.SafeLineLoader) with TestEnvironment() as env: if custom_preprocessor: pipeline_spec = custom_preprocessor(pipeline_spec, expected, env) with beam.Pipeline(options=PipelineOptions( pickle_library='cloudpickle', **yaml_transform.SafeLineLoader.strip_metadata(pipeline_spec.get( 'options', {} as p: > actual = yaml_transform.expand_pipeline(p, pipeline_spec) apache_beam/yaml/examples/examples_test.py:77: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ target/.tox-py39/py39/lib/python3.9/site-packages/apache_beam/yaml/yaml_transform.py:1035: in expand_pipeline return YamlTransform( target/.tox-py39/py39/lib/python3.9/site-packages/apache_beam/yaml/yaml_transform.py:1006: in expand result = expand_transform( target/.tox-py39/py39/lib/python3.9/site-packages/apache_beam/yaml/yaml_transform.py:455: in expand_transform return expand_composite_transform(spec, scope) target/.tox-py39/py39/lib/python3.9/site-packages/apache_beam/yaml/yaml_transform.py:529: in expand_composite_transform return CompositePTransform.expand(None) target/.tox-py39/py39/lib/python3.9/site-packages/apache_beam/yaml/yaml_transform.py:520: in expand inner_scope.compute_all() target/.tox-py39/py39/lib/python3.9/site-packages/apache_beam/yaml/yaml_transform.py:247: in compute_all self.compute_outputs(transform_id) target/.tox-py39/py39/lib/python3.9/site-packages/apache_beam/yaml/yaml_transform.py:96: in wrapper self._cache[key] = func(self, *args) target/.tox-py39/py39/lib/python3.9/site-packages/apache_beam/yaml/yaml_transform.py:283: in compute_outputs return expand_transform(self._transforms_by_uuid[transform_id], self) target/.tox-py39/py39/lib/python3.9/site-packages/apache_beam/yaml/yaml_transform.py:457: in expand_transform return expand_leaf_transform(spec, scope) target/.tox-py39/py39/lib/python3.9/site-packages/apache_beam/yaml/yaml_transform.py:479: in expand_leaf_transform ptransform = scope.create_ptransform(spec, inputs_dict.values()) target/.tox-py39/py39/lib/python3.9/site-packages/apache_beam/yaml/yaml_transform.py:373: in create_ptransform provider = self.best_provider(spec, input_providers) else: env_list = None # Use execv instead of execve. executable = os.fsencode(executable) if os.path.dirname(executable): executable_list = (executable,) else: # This matches the behavior of os._execvpe(). executable_list = tuple( os.path.join(os.fsencode(dir), executable) for dir in os.get_exec_path(env)) fds_to_keep = set(pass_fds) fds_to_keep.add(errpipe_write) self.pid = _posixsubprocess.fork_exec( args, executable_list, close_fds, tuple(sorted(map(int, fds_to_keep))), cwd, env_list, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, errpipe_read, errpipe_write, restore_signals, start_new_session, gid, gids, uid, umask, preexec_fn) self._child_created = True finally: # be sure the FD is closed no matter what os.close(errpipe_write) self._close_pipe_fds(p2cread,
Re: [I] [Failing Test]: beam_PreCommit_Python_Coverage suite fails with ModuleNotFoundError: No module named 'pydantic._hypothesis_plugin' [beam]
tvalentyn commented on issue #30852: URL: https://github.com/apache/beam/issues/30852#issuecomment-2037998636 Not sure how pydantic comes in the picture here yet, but for 'pydantic._hypothesis_plugin' to be available, it seems that we need to have 1.0.0
Re: [PR] Implementing lull reporting at bundle level processing [beam]
dustin12 commented on code in PR #30693: URL: https://github.com/apache/beam/pull/30693#discussion_r1552236808 ## runners/core-java/src/main/java/org/apache/beam/runners/core/metrics/ExecutionStateTracker.java: ## @@ -335,6 +352,19 @@ protected void takeSampleOnce(long millisSinceLastSample) { transitionsAtLastSample = transitionsAtThisSample; } updateMillisSinceLastTransition(millisSinceLastSample, state); +updateMillisSinceBundleStart(millisSinceLastSample); + } + + // Override this to implement bundle level lull reporting. + protected void reportBundleLull(Thread trackedThread, long millisSinceBundleStart) {} + + @SuppressWarnings("NonAtomicVolatileUpdate") + private void updateMillisSinceBundleStart(long millisSinceLastSample) { +millisSinceBundleStart += millisSinceLastSample; +if (millisSinceBundleStart > nextBundleLullReportMs) { + reportBundleLull(trackedThread, millisSinceBundleStart); + nextBundleLullReportMs += BUNDLE_LULL_REPORT_MS; +} } Review Comment: maybe a BEAM/Java developer can help here. Is there a reason we supress these warnings? Is it safe to do for some reason? It seems to indicate a possible data race which we should fix? ## runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/StackTraceUtil.java: ## @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.runners.dataflow.worker; + +import org.apache.beam.runners.core.SimpleDoFnRunner; +import org.apache.beam.sdk.annotations.Internal; +import org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.collect.ImmutableSet; + +/** Utility methods to print the stack traces of all the threads. */ +@Internal +public final class StackTraceUtil { + private static final ImmutableSet FRAMEWORK_CLASSES = + ImmutableSet.of(SimpleDoFnRunner.class.getName(), DoFnInstanceManagers.class.getName()); + + public static String getStackTraceForLullMessage(StackTraceElement[] stackTrace) { +StringBuilder message = new StringBuilder(); +for (StackTraceElement e : stackTrace) { + if (FRAMEWORK_CLASSES.contains(e.getClassName())) { +break; Review Comment: where is this code from? It doesn't seem to have moved from another file and I'm not sure why it sbeing added. I'm also a bit confused as to why we would stop a stacktrace as soon as we got to a Framework class, it doesn't seem related to bundle lulls. ## runners/core-java/src/main/java/org/apache/beam/runners/core/metrics/ExecutionStateTracker.java: ## @@ -139,8 +140,17 @@ public String getDescription() { */ private volatile long millisSinceLastTransition = 0; + /** + * The number of milliseconds since the {@link ExecutionStateTracker} initial state. + * + * This variable is updated by the Sampling thread, and read by the Progress Reporting thread, + * thus it being marked volatile. + */ + private volatile long millisSinceBundleStart = 0; Review Comment: why is this marked as volatile? And thus needs the supress warning on line 361. ## runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/DataflowExecutionContext.java: ## @@ -310,12 +350,64 @@ public Closeable activate() { } } +private String getBundleLullMessage(Thread trackedThread, Duration lullDuration) { + StringBuilder message = new StringBuilder(); + message + .append("Operation ongoing in bundle for at least ") + .append(DURATION_FORMATTER.print(lullDuration.toPeriod())) + .append(" without completing") + .append("\n"); + synchronized (this) { +if (this.activeMessageMetadata != null) { + message.append( + "Current user step name: " + getActiveMessageMetadata().get().userStepName() + "\n"); + message.append( + "Time spent in this step(millis): " + + (clock.currentTimeMillis() - getActiveMessageMetadata().get().startTime()) + + "\n"); +} +message.append("Processing times in each
[PR] [Test only] Introduce checkpointtimeout [beam]
Abacn opened a new pull request, #30853: URL: https://github.com/apache/beam/pull/30853 Test only, attempt to fix https://github.com/apache/beam/pull/30218#issuecomment-2037429676 However, found that if the checkpoint can expire, the messages unacked will get redelivered, but the new checkpoint won't finalize either, causing repetitive redeliver and duplicate messages. I now think the underlying issue is that finalizing checkpoint is not guaranteed to happen timely. This is work as expected. Expiring outstanding checkpoint won't resolve the stucked messages when throughput is low given the asynchronous ack mechanism. Thus, another option is to expose "Auto acknowledge" client option to JmsIO to retain the previous behavior, however, data loss is possible, as in previous implementation **Please** add a meaningful description for your change here Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] Mention the appropriate issue in your description (for example: `addresses #123`), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment `fixes #` instead. - [ ] Update `CHANGES.md` with noteworthy changes. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://github.com/apache/beam/blob/master/CONTRIBUTING.md#make-the-reviewers-job-easier). To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md) GitHub Actions Tests Status (on master branch) [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule) [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule) [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule) [![Go tests](https://github.com/apache/beam/workflows/Go%20tests/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule) See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI or the [workflows README](https://github.com/apache/beam/blob/master/.github/workflows/README.md) to see a list of phrases to trigger workflows. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [yaml] disable javascript mapping for python >=3.12 [beam]
Polber commented on code in PR #30843: URL: https://github.com/apache/beam/pull/30843#discussion_r1552245985 ## sdks/python/setup.py: ## @@ -368,7 +368,7 @@ def get_portability_package_data(): 'grpcio>=1.33.1,!=1.48.0,<2', 'hdfs>=2.1.0,<3.0.0', 'httplib2>=0.8,<0.23.0', - 'js2py>=0.74,<1', + 'js2py>=0.74,<1; python_version<"3.12"', Review Comment: Oh sorry I totally misunderstood what you were asking, thanks for adding that! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] ignore -- end of options [beam]
Polber closed pull request #30845: ignore -- end of options URL: https://github.com/apache/beam/pull/30845 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [Java] ManagedIO [beam]
robertwb commented on code in PR #30808: URL: https://github.com/apache/beam/pull/30808#discussion_r1552206457 ## sdks/java/managed/src/main/java/org/apache/beam/sdk/managed/Managed.java: ## @@ -0,0 +1,160 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.managed; + +import com.google.auto.value.AutoValue; +import java.util.Map; +import java.util.regex.Pattern; +import javax.annotation.Nullable; +import org.apache.beam.sdk.schemas.transforms.SchemaTransform; +import org.apache.beam.sdk.values.PCollectionRowTuple; + +public class Managed { + public static Read read() { Review Comment: Yes, I'm totally fine with a static list for the initial release. The key point is getting the API stable and generating the right protos. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Update python transform catalog [beam]
damccorm commented on code in PR #30788: URL: https://github.com/apache/beam/pull/30788#discussion_r1552164898 ## sdks/python/apache_beam/examples/snippets/transforms/aggregation/batchelements_test.py: ## @@ -0,0 +1,49 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import unittest + +import mock +from apache_beam.testing.test_pipeline import TestPipeline +from apache_beam.testing.util import assert_that, equal_to + +from . import batchelements + + +def check_batches(actual): + expected = [['', '凌', ''], ['', '凌', ''], ['', '凌', '', '']] Review Comment: I'm a little bit skeptical of this test since the batches produced here are an implementation detail of how this is broken into bundles. Could we maybe instead check: 1) That this does batching at all (returns lists, not strings) 2) doesn't drop elements (if we pull them out of lists, and then count by element they return the correct counts) ## sdks/python/apache_beam/examples/snippets/transforms/aggregation/tolist_test.py: ## @@ -0,0 +1,47 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import mock +import unittest + +from apache_beam.testing.test_pipeline import TestPipeline +from apache_beam.testing.util import assert_that, equal_to + +from . import tolist + + +def identity(x): + return x + + +@mock.patch('apache_beam.Pipeline', TestPipeline) +# pylint: disable=line-too-long +@mock.patch( +'apache_beam.examples.snippets.transforms.aggregation.tolist.print', +identity) +# pylint: enable=line-too-long +class BatchElementsTest(unittest.TestCase): + def test_tolist(self): +def check(result): + assert_that(result, equal_to([['', '凌', '', '']])) Review Comment: Similarly, ordering isn't guaranteed here; could we try to test that it is a list and it contains all the correct elements? ## website/www/site/content/en/documentation/transforms/python/overview.md: ## @@ -48,13 +48,13 @@ limitations under the License. TransformDescription - ApproximateQuantilesNot available. See https://issues.apache.org/jira/browse/BEAM-6694;>BEAM-6694 for updates. - ApproximateUniqueNot available. See https://issues.apache.org/jira/browse/BEAM-6693;>BEAM-6693 for updates. + ApproximateQuantilesGiven a distribution, find the approximate N-tiles. + ApproximateUniqueGiven a pcollection, return the estimated number of unique elements. + BatchElementsGiven a pcollection, return the estimated number of unique elements. Review Comment: Thanks for catching these. Could we (a) also add ToList and (b) add these transforms to https://github.com/apache/beam/blob/21129a41e031c150c3f610639d71a95a3a941243/website/www/site/layouts/partials/section-menu/en/documentation.html#L324 ## sdks/python/apache_beam/examples/snippets/transforms/aggregation/batchelements_test.py: ## @@ -0,0 +1,49 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in
Re: [PR] Initial Iceberg Sink [beam]
kennknowles commented on PR #30797: URL: https://github.com/apache/beam/pull/30797#issuecomment-2037849642 > OK I have done a whole massive revision and tested it a little bit more. > > The only piece that I have not revised is the `IcebergCatalogConfig` which gets turned into an `org.apache.iceberg.catalog.Catalog` on the client and each worker separately. I think your suggestion was to try to use just a big key-value map for all the config values. I am fine with that. I don't really know enough about it yet. All my deep dives into iceberg Java libraries was for other pieces. It looks like this might work: https://github.com/tabular-io/iceberg-kafka-connect/blob/5ab5c538efab9ccf3cde166f36ba34189eed7187/kafka-connect/src/main/java/io/tabular/iceberg/connect/IcebergSinkConfig.java#L256 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Initial Iceberg Sink [beam]
kennknowles commented on PR #30797: URL: https://github.com/apache/beam/pull/30797#issuecomment-2037846183 OK I have done a whole massive revision and tested it a little bit more. The only piece that I have not revised is the `IcebergCatalogConfig` which gets turned into an `org.apache.iceberg.catalog.Catalog` on the client and each worker separately. I think your suggestion was to try to use just a big key-value map for all the config values. I am fine with that. I don't really know enough about it yet. All my deep dives into iceberg Java libraries was for other pieces. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Initial Iceberg Sink [beam]
kennknowles commented on code in PR #30797: URL: https://github.com/apache/beam/pull/30797#discussion_r1552179077 ## sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/WriteToDestinations.java: ## @@ -0,0 +1,242 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.io.iceberg; + +import java.util.Collections; +import java.util.UUID; +import org.apache.beam.io.iceberg.WriteBundlesToFiles.Result; +import org.apache.beam.sdk.Pipeline; +import org.apache.beam.sdk.coders.Coder; +import org.apache.beam.sdk.coders.IterableCoder; +import org.apache.beam.sdk.coders.KvCoder; +import org.apache.beam.sdk.coders.SerializableCoder; +import org.apache.beam.sdk.coders.ShardedKeyCoder; +import org.apache.beam.sdk.coders.StringUtf8Coder; +import org.apache.beam.sdk.transforms.Create; +import org.apache.beam.sdk.transforms.DoFn; +import org.apache.beam.sdk.transforms.Flatten; +import org.apache.beam.sdk.transforms.GroupByKey; +import org.apache.beam.sdk.transforms.MapElements; +import org.apache.beam.sdk.transforms.PTransform; +import org.apache.beam.sdk.transforms.ParDo; +import org.apache.beam.sdk.transforms.SimpleFunction; +import org.apache.beam.sdk.transforms.View; +import org.apache.beam.sdk.transforms.windowing.BoundedWindow; +import org.apache.beam.sdk.values.KV; +import org.apache.beam.sdk.values.PCollection; +import org.apache.beam.sdk.values.PCollectionList; +import org.apache.beam.sdk.values.PCollectionTuple; +import org.apache.beam.sdk.values.PCollectionView; +import org.apache.beam.sdk.values.ShardedKey; +import org.apache.beam.sdk.values.TupleTag; +import org.apache.beam.sdk.values.TupleTagList; +import org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.annotations.VisibleForTesting; +import org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.collect.ImmutableList; +import org.apache.iceberg.AppendFiles; +import org.apache.iceberg.DataFile; +import org.apache.iceberg.Snapshot; +import org.apache.iceberg.Table; +import org.apache.iceberg.catalog.TableIdentifier; + +class WriteToDestinations +extends PTransform< +PCollection>, IcebergWriteResult> { + + @VisibleForTesting static final int DEFAULT_MAX_WRITERS_PER_BUNDLE = 20; + @VisibleForTesting static final int DEFAULT_MAX_FILES_PER_PARTITION = 10_000; + @VisibleForTesting static final long DEFAULT_MAX_BYTES_PER_PARTITION = 10L * (1L << 40); // 10TB + static final long DEFAULT_MAX_BYTES_PER_FILE = (1L << 40); // 1TB + static final int DEFAULT_NUM_FILE_SHARDS = 0; + static final int FILE_TRIGGERING_RECORD_COUNT = 50_000; + + private final Coder destinationCoder; + + private final RecordWriterFactory recordWriterFactory; + private final TableFactory tableFactory; + + WriteToDestinations( + Coder destinationCoder, + RecordWriterFactory recordWriterFactory, + TableFactory tableFactory) { +this.destinationCoder = destinationCoder; +this.recordWriterFactory = recordWriterFactory; +this.tableFactory = tableFactory; + } + + private PCollectionView createJobIdPrefixView(Pipeline p) { + +final String jobName = p.getOptions().getJobName(); + +return p.apply("JobIdCreationRoot_", Create.of((Void) null)) +.apply( +"CreateJobId", +ParDo.of( +new DoFn() { + @ProcessElement + public void process(ProcessContext c) { +c.output(jobName + "-" + UUID.randomUUID().toString()); + } +})) +.apply("JobIdSideInput", View.asSingleton()); + } + + @Override + public IcebergWriteResult expand( + PCollection> input) { + +final PCollectionView fileView = createJobIdPrefixView(input.getPipeline()); +// We always do the equivalent of a dynamically sharded file creation +TupleTag> writtenFilesTag = new TupleTag<>("writtenFiles"); +TupleTag, ElementT>> successfulWritesTag = +new TupleTag<>("successfulWrites"); +TupleTag, ElementT>> failedWritesTag = +new TupleTag<>("failedWrites"); +TupleTag> snapshotsTag = new TupleTag<>("snapshots"); + +final Coder elementCoder = +((KvCoder)
Re: [PR] [Python] Add a couple quality-of-life improvemenets to `testing.util.assert_that` [beam]
robertwb commented on code in PR #30771: URL: https://github.com/apache/beam/pull/30771#discussion_r1552178049 ## sdks/python/apache_beam/testing/util.py: ## @@ -261,6 +261,19 @@ def assert_that( """ assert isinstance(actual, pvalue.PCollection), ( '%s is not a supported type for Beam assert' % type(actual)) + pipeline = actual.pipeline + if getattr(pipeline, 'result', None) is not None: +# The pipeline was already run. The user most likely called assert_that +# after the pipeleline context. +raise RuntimeError( +'assert_that must be used within a beam.Pipeline context') + + # If label is already in use, just append a number to it. Review Comment: Maybe add something about not needing stability of names across pipeline update in a testing context? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Initial Iceberg Sink [beam]
codecov-commenter commented on PR #30797: URL: https://github.com/apache/beam/pull/30797#issuecomment-2037829820 ## [Codecov](https://app.codecov.io/gh/apache/beam/pull/30797?dropdown=coverage=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache) Report All modified and coverable lines are covered by tests :white_check_mark: > Project coverage is 0.00%. Comparing base [(`61eee6d`)](https://app.codecov.io/gh/apache/beam/commit/61eee6dd672800ed88bd0851a235e7b13ee10847?dropdown=coverage=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache) to head [(`8faea3f`)](https://app.codecov.io/gh/apache/beam/pull/30797?dropdown=coverage=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache). > Report is 1 commits behind head on master. > :exclamation: Current head 8faea3f differs from pull request most recent head 5af12aa. Consider uploading reports for the commit 5af12aa to get more accurate results Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #30797 +/- ## = - Coverage 70.74%0 -70.75% = Files 12560 -1256 Lines1407690 -140769 Branches 43070 -4307 = - Hits 995920-99592 + Misses377000-37700 + Partials 34770 -3477 ``` | [Flag](https://app.codecov.io/gh/apache/beam/pull/30797/flags?src=pr=flags_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache) | Coverage Δ | | |---|---|---| | [go](https://app.codecov.io/gh/apache/beam/pull/30797/flags?src=pr=flag_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache) | `?` | | | [java](https://app.codecov.io/gh/apache/beam/pull/30797/flags?src=pr=flag_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache) | `?` | | | [python](https://app.codecov.io/gh/apache/beam/pull/30797/flags?src=pr=flag_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache) | `?` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache#carryforward-flags-in-the-pull-request-comment) to find out more. [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/beam/pull/30797?dropdown=coverage=pr=continue_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [Failing Test]: beam_PreCommit_Python_Coverage suite fails with ModuleNotFoundError: No module named 'pydantic._hypothesis_plugin' [beam]
tvalentyn opened a new issue, #30852: URL: https://github.com/apache/beam/issues/30852 ### What happened? The 'coverage' suite runs some Beam unit tests in environments with different versions a particular dependency, for example we test severalversions of pyarrow or tft. The py38-tft-113 suite currently fails, likely due to a incompatible dependencies in tox environment: ``` = test session starts == Plugin: terminalreporter, Hook: pytest_sessionfinish ModuleNotFoundError: No module named 'pydantic._hypothesis_plugin' For more information see https://pluggy.readthedocs.io/en/stable/api_reference.html#pluggy.PluggyTeardownRaisedWarning config.hook.pytest_sessionfinish( Traceback (most recent call last): File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py38/build/srcs/sdks/python/target/.tox-py38-tft-113/py38-tft-113/bin/pytest", line 10, in sys.exit(console_main()) File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py38/build/srcs/sdks/python/target/.tox-py38-tft-113/py38-tft-113/lib/python3.8/site-packages/_pytest/config/__init__.py", line 192, in console_main code = main() File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py38/build/srcs/sdks/python/target/.tox-py38-tft-113/py38-tft-113/lib/python3.8/site-packages/_pytest/config/__init__.py", line 169, in main ret: Union[ExitCode, int] = config.hook.pytest_cmdline_main( File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py38/build/srcs/sdks/python/target/.tox-py38-tft-113/py38-tft-113/lib/python3.8/site-packages/pluggy/_hooks.py", line 501, in __call__ return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult) File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py38/build/srcs/sdks/python/target/.tox-py38-tft-113/py38-tft-113/lib/python3.8/site-packages/pluggy/_manager.py", line 119, in _hookexec return self._inner_hookexec(hook_name, methods, kwargs, firstresult) File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py38/build/srcs/sdks/python/target/.tox-py38-tft-113/py38-tft-113/lib/python3.8/site-packages/pluggy/_callers.py", line 138, in _multicall raise exception.with_traceback(exception.__traceback__) File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py38/build/srcs/sdks/python/target/.tox-py38-tft-113/py38-tft-113/lib/python3.8/site-packages/pluggy/_callers.py", line 102, in _multicall res = hook_impl.function(*args) File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py38/build/srcs/sdks/python/target/.tox-py38-tft-113/py38-tft-113/lib/python3.8/site-packages/_pytest/main.py", line 318, in pytest_cmdline_main return wrap_session(config, _main) File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py38/build/srcs/sdks/python/target/.tox-py38-tft-113/py38-tft-113/lib/python3.8/site-packages/_pytest/main.py", line 306, in wrap_session config.hook.pytest_sessionfinish( File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py38/build/srcs/sdks/python/target/.tox-py38-tft-113/py38-tft-113/lib/python3.8/site-packages/pluggy/_hooks.py", line 501, in __call__ return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult) File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py38/build/srcs/sdks/python/target/.tox-py38-tft-113/py38-tft-113/lib/python3.8/site-packages/pluggy/_manager.py", line 119, in _hookexec return self._inner_hookexec(hook_name, methods, kwargs, firstresult) File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py38/build/srcs/sdks/python/target/.tox-py38-tft-113/py38-tft-113/lib/python3.8/site-packages/pluggy/_callers.py", line 155, in _multicall teardown[0].send(outcome) File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py38/build/srcs/sdks/python/target/.tox-py38-tft-113/py38-tft-113/lib/python3.8/site-packages/_pytest/terminal.py", line 867, in pytest_sessionfinish self.config.hook.pytest_terminal_summary( File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py38/build/srcs/sdks/python/target/.tox-py38-tft-113/py38-tft-113/lib/python3.8/site-packages/pluggy/_hooks.py", line 501, in __call__ return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult) File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py38/build/srcs/sdks/python/target/.tox-py38-tft-113/py38-tft-113/lib/python3.8/site-packages/pluggy/_manager.py", line 119, in _hookexec return self._inner_hookexec(hook_name, methods, kwargs, firstresult) File "/runner/_work/beam/beam/sdks/python/test-suites/tox/py38/build/srcs/sdks/python/target/.tox-py38-tft-113/py38-tft-113/lib/python3.8/site-packages/pluggy/_callers.py", line 181, in _multicall return outcome.get_result() File
Re: [I] [Failing Test]: beam_PreCommit_Python_Coverage suite fails with ModuleNotFoundError: No module named 'pydantic._hypothesis_plugin' [beam]
tvalentyn commented on issue #30852: URL: https://github.com/apache/beam/issues/30852#issuecomment-2037827468 Sample error: https://github.com/apache/beam/actions/runs/8548288262/job/23421776246?pr=30843 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [Python] Add a couple quality-of-life improvemenets to `testing.util.assert_that` [beam]
tvalentyn commented on PR #30771: URL: https://github.com/apache/beam/pull/30771#issuecomment-2037828423 I think that error is unrelated to this change, I see it on other PRs as well. Filed: https://github.com/apache/beam/issues/30852 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Allow lazy iteration for non-reiterables. [beam]
github-actions[bot] commented on PR #30851: URL: https://github.com/apache/beam/pull/30851#issuecomment-2037799521 Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment `assign set of reviewers` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Fix Jms drop record [beam]
Abacn commented on PR #30218: URL: https://github.com/apache/beam/pull/30218#issuecomment-2037783134 Yeah, it seems authors of this IO connector were aware of prefetch in implementation affects how the message gets delivered. Here I hava a proposal, introduce a "checkPointTimeout" option for JmsIO.read, default to 0 (never timeout), and if works as when the last advance return true and current advance return false passed the checkPointTimeout, then we revoke the lastly made jmscheckpoint, that is close the session without acknowledge messages in it This will cause some duplicates, but it may unblock the mesdages sit in the internal buffer of the previous session -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Fix Jms drop record [beam]
ppawel commented on PR #30218: URL: https://github.com/apache/beam/pull/30218#issuecomment-2037745694 > We have to recreate the consumer/session to properly handle checkpoints, otherwise there were data losses. This is due to the limitation that message acknowledge in Jms is per session. All messages that delivered by the time of acknowledging a message within a session will be marked as acknowledged. I've never looked so deep into JMS spec, to be honest it is a bit strange if it in fact works like this with regards to acks (that it's enough to ack one single message). I think this could also be subject to testing between implementations, I can check it at some point with my case (Solace broker) but first need to deal somehow with this deadlock situation. > Agree with the analysis and I think the cause is same here: https://github.com/apache/beam/pull/30218/files#diff-a63812b51f93708cc60430f314b496ae1110425c6a8ae4c85e59573cfb8f0938R204-R207 OK but as I understand, this is only for the direct runner, in a runner like Dataflow, finalizing the checkpoint might or might not happen at some point and "fix" the deadlock but the root cause will still be there. > Is there a way to release the messages in internal buffer but do not acknowledge the receive message within the same session? I don't think it's possible at JMS API level to do this, and even at the implementation level I don't see any easy access to those internal queues/buffers in both clients. There are some parameters to control how big is the buffer etc. but I think JmsIO should ideally work regardless of such parameters. After all, those buffers are there for a reason (performance). I see the comment in `org.apache.beam.sdk.io.jms.JmsIOTest#testCheckpointMark` regarding testing without "prefetch" - I think it would be good to also test some scenarios including prefetch and this internal buffer. I can't promise anything due to time constraints but I might try to implement a test case that reproduces my issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [yaml] disable javascript mapping for python >=3.12 [beam]
tvalentyn commented on code in PR #30843: URL: https://github.com/apache/beam/pull/30843#discussion_r1552087003 ## sdks/python/setup.py: ## @@ -368,7 +368,7 @@ def get_portability_package_data(): 'grpcio>=1.33.1,!=1.48.0,<2', 'hdfs>=2.1.0,<3.0.0', 'httplib2>=0.8,<0.23.0', - 'js2py>=0.74,<1', + 'js2py>=0.74,<1; python_version<"3.12"', Review Comment: ```suggestion # https://github.com/PiotrDabkowski/Js2Py/issues/317 'js2py>=0.74,<1; python_version<"3.12"', ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Fix Jms drop record [beam]
Abacn commented on PR #30218: URL: https://github.com/apache/beam/pull/30218#issuecomment-2037690938 We have to recreate the consumer/session to properly handle checkpoints, otherwise there were data losses. This is due to the limitation that message acknowledge in Jms is per session. All messages that delivered by the time of acknowledging a message within a session will be marked as acknowledged. Agree with the analysis and I think the cause is same here: https://github.com/apache/beam/pull/30218/files#diff-a63812b51f93708cc60430f314b496ae1110425c6a8ae4c85e59573cfb8f0938R204-R207 The yet-finalized checkpoint hold an active session which may contain messages in its internal buffer. Those message won't get released until a checkpoint being finalized. I tested my PR with IBM MQ and it had no issue. So this is still implementation related, though. Is there a way to release the messages in internal buffer but do not acknowledge the receive message within the same session? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [Java] ManagedIO [beam]
chamikaramj commented on code in PR #30808: URL: https://github.com/apache/beam/pull/30808#discussion_r1552055943 ## sdks/java/managed/src/main/java/org/apache/beam/sdk/managed/Managed.java: ## @@ -0,0 +1,160 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.managed; + +import com.google.auto.value.AutoValue; +import java.util.Map; +import java.util.regex.Pattern; +import javax.annotation.Nullable; +import org.apache.beam.sdk.schemas.transforms.SchemaTransform; +import org.apache.beam.sdk.values.PCollectionRowTuple; + +public class Managed { + public static Read read() { +return new AutoValue_Managed_Read.Builder().setPattern(Read.PATTERN).build(); + } + + @AutoValue + public abstract static class Read extends SchemaTransform { +protected static final Pattern PATTERN = Review Comment: Sounds good. We can explicitly list instead of trying to do a pattern matching. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [Java] ManagedIO [beam]
chamikaramj commented on code in PR #30808: URL: https://github.com/apache/beam/pull/30808#discussion_r1552052753 ## sdks/java/managed/src/main/java/org/apache/beam/sdk/managed/Managed.java: ## @@ -0,0 +1,160 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.managed; + +import com.google.auto.value.AutoValue; +import java.util.Map; +import java.util.regex.Pattern; +import javax.annotation.Nullable; +import org.apache.beam.sdk.schemas.transforms.SchemaTransform; +import org.apache.beam.sdk.values.PCollectionRowTuple; + +public class Managed { + public static Read read() { Review Comment: Robert, are you OK with keeping this as a static list for the initial PR and making it dynamic (may be via codegen) in the future ? I think a dynamic will list only be helpful if we can drop new jars with SchemaTransforms that we would like to manage to an older Beam version. At least initially, we will be very aware about the set of transforms we support (and the list will be small). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Fix Jms drop record [beam]
ppawel commented on PR #30218: URL: https://github.com/apache/beam/pull/30218#issuecomment-2037595552 @Abacn I am using GCP Dataflow but the test case is with DirectRunner. I am now digging deeper into this and this is easily reproducible when you force the data source splits to 1 in `UnboundedJmsSource#split`. This means that you will have only one connection/session/consumer and there will basically be a deadlock between the first (original) consumer and the one created when doing a checkpoint. I temporarily removed the code that creates the new consumer and also removed closing the consumer and session from `JmsCheckpointMark` and now it works fine as expected. So I think this is most likely related to closing/opening the session/consumer. > In the case of 50 outstanding messages, are these messages already acknowledged in the broker side (which means data loss)? Or there are still unacknowledged but do not send to other session by the broker ? They are marked as "not acknowledged" by the broker but as "received" by one consumer. The other consumer does not get those messages. I guess the broker would send those messages to the second consumer when the first one is closed but closing only happens when finalizing the checkpoint and that doesn't happen because only the new consumer (which does not have any messages) is being called in `advance` so in the end nothing is being done, it just spins around calling `receive`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Allow lazy iteration for non-reiterables. [beam]
robertwb opened a new pull request, #30851: URL: https://github.com/apache/beam/pull/30851 In particular Runner v2 does not produce Reiterables, which resulted in the entire stream being read into memory. In this case we can leverage the fact that the first 100MB will be cached and quick to reiterate over. Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] Mention the appropriate issue in your description (for example: `addresses #123`), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment `fixes #` instead. - [ ] Update `CHANGES.md` with noteworthy changes. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://github.com/apache/beam/blob/master/CONTRIBUTING.md#make-the-reviewers-job-easier). To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md) GitHub Actions Tests Status (on master branch) [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule) [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule) [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule) [![Go tests](https://github.com/apache/beam/workflows/Go%20tests/badge.svg?branch=master=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule) See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI or the [workflows README](https://github.com/apache/beam/blob/master/.github/workflows/README.md) to see a list of phrases to trigger workflows. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] add terraform for utility cluster. Add name override to gke [beam]
damondouglas commented on code in PR #30847: URL: https://github.com/apache/beam/pull/30847#discussion_r1551939749 ## .test-infra/terraform/google-cloud-platform/google-kubernetes-engine/cluster.tf: ## @@ -34,7 +34,9 @@ resource "google_container_cluster" "default" { enable_private_nodes= true enable_private_endpoint = false } - node_config { -service_account = data.google_service_account.default.email + cluster_autoscaling { Review Comment: Thank you for changing this! Could we also add the oauth scope for Google cloud platform? https://www.googleapis.com/auth/cloud-platform ## .test-infra/terraform/google-cloud-platform/google-kubernetes-engine/outputs.tf: ## @@ -0,0 +1,26 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +output kubernetes_api_endpoint { + value = google_container_cluster.default.endpoint +} + +output cluster_ca_certificate { Review Comment: Thank you for adding outputs :-). Could you tell me what this output is needed for? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Fix Jms drop record [beam]
Abacn commented on PR #30218: URL: https://github.com/apache/beam/pull/30218#issuecomment-2037519755 Also, which runner are you using? I may noticed this issue and noted in https://github.com/apache/beam/pull/30218/files#diff-a63812b51f93708cc60430f314b496ae1110425c6a8ae4c85e59573cfb8f0938R204 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Fix Jms drop record [beam]
Abacn commented on PR #30218: URL: https://github.com/apache/beam/pull/30218#issuecomment-2037494326 Hi @ppawel thanks for reporting this. In the case of 50 outstanding messages, are these messages already acknowledged in the broker side (which means data loss)? Or there are still unacknowledged but do not send to other session by the broker ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [Bug]: ClassCastException with unknown fields in STORAGE_WRITE_API [beam]
lucasnoetzold opened a new issue, #30850: URL: https://github.com/apache/beam/issues/30850 ### What happened? We're gettig a `ClassCastException` when trying to write data to a BigQuery table with the STORAGE_WRITE_API. This only happens with unknown fields on the payload, both `ignoreUnknownValues` and `withAutoSchemaUpdate` options are enabled and the field being written is a REPEATED RECORD. **Conditions:** - BigQuery table that contains a "RECORD" type field - the record field must be on mode "repeated" - `.ignoreUnknownValues()` option enabled - `.withAutoSchemaUpdate(true)` option enabled - payload that contains any unknown field inside the RECORD field that does exist **Expected Behavior:** unknown fields are ignored and the known payload is sent to BigQuery. --- I've managed to reproduce the error with this minimal code: ```java var nested = List.of(Map.of("nested_field_that_exists", "something")); var row = new TableRow().set("root_field", nested); var pipeline = Pipeline.create(); pipeline.apply(Create.of(row)) .apply(BigQueryIO.write() .ignoreUnknownValues() .withAutoSchemaUpdate(true) .withFormatFunction(identity()) .withMethod(BigQueryIO.Write.Method.STORAGE_WRITE_API) .to(new TableReference() .setProjectId("the_project_id") .setDatasetId("the_dataset_id") .setTableId("test_table"))); pipeline.run(); ``` The BigQuery table schema used is the following: ![image](https://github.com/apache/beam/assets/27974653/491bb62f-a7d7-462f-86ee-1468a14e8220) Looking at the stack we can see that it happens when it's handling the unknown fields. Since the unknown field is nested, and the field descriptor is inherited from the root field, it expects it to be a list (since the root field is at mode REPEATED): ![image](https://github.com/apache/beam/assets/27974653/dd0fcd24-2fe8-4760-bede-5258d7ef037e) ### Issue Priority Priority: 2 (default / most bugs should be filed as P2) ### Issue Components - [ ] Component: Python SDK - [X] Component: Java SDK - [ ] Component: Go SDK - [ ] Component: Typescript SDK - [ ] Component: IO connector - [ ] Component: Beam YAML - [ ] Component: Beam examples - [ ] Component: Beam playground - [ ] Component: Beam katas - [ ] Component: Website - [ ] Component: Spark Runner - [ ] Component: Flink Runner - [ ] Component: Samza Runner - [ ] Component: Twister2 Runner - [ ] Component: Hazelcast Jet Runner - [ ] Component: Google Cloud Dataflow Runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Fix Jms drop record [beam]
ppawel commented on PR #30218: URL: https://github.com/apache/beam/pull/30218#issuecomment-2037429676 @Abacn I just upgraded to Beam 2.55.0 in my project and my streaming pipeline that uses JmsIO gets stuck during consuming messages, it was working fine with Beam 2.53.0 and earlier. I tracked this down to changes in this PR related to managing JMS resources (consumers, sessions etc.) I have a test case in my project that simply publishes around 60 messages to a queue and then a Beam pipeline is executed to consume and process those messages. This test case was passing before but now it works like this: 1. 60 messages are published to a queue in Solace (message broker we use). 2. Pipeline is started, first consumer is created in JmsIO and it fetches all 60 messages into the internal buffer of the consumer. 3. Beam pipeline consumes 10 messages (`advance` is called 10 times in JmsIO). 4. 10 messages are acked, checkpoint is made and `recreateSession` is called in JmsIO (this is the new code introduced in this PR). 5. New consumer is created, old one is closed. 6. New consumer does not get the messages anymore - the broker seems to think that the first consumer is just late with acknowledging the remaining 50 messages. 7. Pipeline hangs forever, `advance` is called all the time but brings no new messages. 50 messages remain sitting in the queue not consumed. Does this scenario make sense or do I miss something in how JmsIO should work? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [DO NOT MERGE] Flink 1.17 [beam]
Abacn commented on PR #30197: URL: https://github.com/apache/beam/pull/30197#issuecomment-2037426140 Thanks, that's awesome! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Feature Request]: Support for Apache Flink 1.18 in Beam Runners [beam]
Abacn commented on issue #30789: URL: https://github.com/apache/beam/issues/30789#issuecomment-2037424496 As of timeline, as an open source project the Flink runner component has been community driven. Any contribution is welcome. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [DO NOT MERGE] Flink 1.17 [beam]
je-ik commented on PR #30197: URL: https://github.com/apache/beam/pull/30197#issuecomment-2037419229 Hi, unfortunately I did not have time to finish it, yet. But target is currently release 2.56.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Feature Request]: Support for Apache Flink 1.18 in Beam Runners [beam]
Abacn commented on issue #30789: URL: https://github.com/apache/beam/issues/30789#issuecomment-2037417183 Hi all, looks like this is highly wanted. Could Flink runner user help testing #30197 for their use cases? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [DO NOT MERGE] Flink 1.17 [beam]
Abacn commented on PR #30197: URL: https://github.com/apache/beam/pull/30197#issuecomment-2037409172 Hi, as https://lists.apache.org/thread/s8o8jc2k2kb41q5g0v0xmoyszg1gdcst gets resolved by #30403, is this PR ready to be merged? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] python sdk: fix several bugs regarding avto <-> beam schema conversion [beam]
benkonz commented on PR #30770: URL: https://github.com/apache/beam/pull/30770#issuecomment-2037391308 @ahmedabu98 could you take another look at this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] update gcloud to 471 in runner image [beam]
damccorm merged PR #30846: URL: https://github.com/apache/beam/pull/30846 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org