Re: Is implementing DisplayData on Beam Transforms worth?

2021-05-13 Thread Robert Bradshaw
+1, definitely think display data more often belongs to composites
than leafs. Dataflow is moving to a model where it accepts Beam protos
directly; hopefully we can get that information to the UI.

On Thu, May 13, 2021 at 4:47 PM Valentyn Tymofieiev  wrote:
>
> I also happened to look at display data associated with Beam BigQuery IOs. In 
> my opinion,  for IO 'display data' bits to be useful, they need to be 
> visualized at the top-level (composite) transforms.  BQ IOs are one of the 
> most complex transforms in Beam and generate very involved graphs. Display 
> data information becomes too hard to find within the graph.
>
> On Wed, May 12, 2021 at 9:21 AM Reuven Lax  wrote:
>>
>> This is arguably a bug in Dataflow's backend. The backend only knows about 
>> primitive operations (ParDo, Flatten, etc.), and doesn't currently model a 
>> PTransform as an independent entity. Rather it infers the existence of the 
>> PTransform based on the naming of the operations (i.e. if you have 
>> operations named a/b and a/c, you infer a PTransform named a containing b 
>> and c). This is how the Dataflow UI knows how to display composite 
>> transforms.
>>
>> Should Google support PTransforms as a top-level object? Yes - as you 
>> noticed this is an easy way to trip up, and sometimes innocent-seeming 
>> refactoring can cause display data to get "lost." I'm not sure what the 
>> current priority of this bug is, and it may not be fixed until things are 
>> fully on portable pipelines. For now, I suggest putting display data on 
>> primitive operations.
>>
>> Reuven
>>
>> On Wed, May 12, 2021 at 7:10 AM Ismaël Mejía  wrote:
>>>
>>> Running a pipeline on Dataflow I noticed it was not showing the 'display 
>>> data' of ParquetIO on the Dataflow UI, after digging deeper I found that 
>>> composite transforms are not shown on Dataflow.
>>>
>>> BEAM-366 Support Display Data on Composite Transforms
>>> https://issues.apache.org/jira/browse/BEAM-366
>>>
>>> I also noticed that for primitive transforms what is shown is not the 
>>> populateDisplayData code extended from PTransform but the 
>>> populateDisplayData method code implemented at the parametrizing function 
>>> level, concretely the DoFn or Source for the case of IOs.
>>>
>>> This of course surprised me because we have been implemented all these 
>>> methods in the wrong place (at the PTransform level) for years and ignoring 
>>> the function so they are not shown in the UI, so I was wondering:
>>>
>>> 1. Does Google plan to support displaying composite transforms (BEAM-366) 
>>> at some point?
>>>
>>> 2. If (1) is not happening soon, shall we refine all our 
>>> populateDisplayData implementations to be done only at the Function level 
>>> (DoFn, Source, WindowFn)?
>>>
>>> Since Open Source runners (Flink, Spark, etc) do not use DisplayData at all 
>>> I suppose we should keep this discussion at the Dataflow level only at this 
>>> time.
>>>
>>> I ignore how this is modeled on Portable Pipelines, is DisplayData part of 
>>> FunctionSpec to support the current use case? I saw that DisplayData is 
>>> considered at the PTransform level so this should cover the Composite case, 
>>> so I am curious if we are considering the parametrized function level 
>>> currently in use correctly for Portable pipelines.
>>>


Re: Is implementing DisplayData on Beam Transforms worth?

2021-05-13 Thread Valentyn Tymofieiev
I also happened to look at display data associated with Beam BigQuery IOs.
In my opinion,  for IO 'display data' bits to be useful, they need to be
visualized at the top-level (composite) transforms.  BQ IOs are one of the
most complex transforms in Beam and generate very involved graphs. Display
data information becomes too hard to find within the graph.

On Wed, May 12, 2021 at 9:21 AM Reuven Lax  wrote:

> This is arguably a bug in Dataflow's backend. The backend only knows about
> primitive operations (ParDo, Flatten, etc.), and doesn't currently model a
> PTransform as an independent entity. Rather it infers the existence of the
> PTransform based on the naming of the operations (i.e. if you have
> operations named a/b and a/c, you infer a PTransform named a containing b
> and c). This is how the Dataflow UI knows how to display composite
> transforms.
>
> Should Google support PTransforms as a top-level object? Yes - as you
> noticed this is an easy way to trip up, and sometimes innocent-seeming
> refactoring can cause display data to get "lost." I'm not sure what the
> current priority of this bug is, and it may not be fixed until things are
> fully on portable pipelines. For now, I suggest putting display data on
> primitive operations.
>
> Reuven
>
> On Wed, May 12, 2021 at 7:10 AM Ismaël Mejía  wrote:
>
>> Running a pipeline on Dataflow I noticed it was not showing the 'display
>> data' of ParquetIO on the Dataflow UI, after digging deeper I found that
>> composite transforms are not shown on Dataflow.
>>
>> BEAM-366 Support Display Data on Composite Transforms
>> https://issues.apache.org/jira/browse/BEAM-366
>>
>> I also noticed that for primitive transforms what is shown is not the
>> populateDisplayData code extended from PTransform but the
>> populateDisplayData method code implemented at the parametrizing function
>> level, concretely the DoFn or Source for the case of IOs.
>>
>> This of course surprised me because we have been implemented all these
>> methods in the wrong place (at the PTransform level) for years and ignoring
>> the function so they are not shown in the UI, so I was wondering:
>>
>> 1. Does Google plan to support displaying composite transforms (BEAM-366)
>> at some point?
>>
>> 2. If (1) is not happening soon, shall we refine all our
>> populateDisplayData implementations to be done only at the Function level
>> (DoFn, Source, WindowFn)?
>>
>> Since Open Source runners (Flink, Spark, etc) do not use DisplayData at
>> all I suppose we should keep this discussion at the Dataflow level only at
>> this time.
>>
>> I ignore how this is modeled on Portable Pipelines, is DisplayData part
>> of FunctionSpec to support the current use case? I saw that DisplayData is
>> considered at the PTransform level so this should cover the Composite case,
>> so I am curious if we are considering the parametrized function level
>> currently in use correctly for Portable pipelines.
>>
>>


Re: [EXT] Re: [EXT] Re: [EXT] Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-05-13 Thread Robert Bradshaw
Sharding is determined by the distribution of work. Each worker writes to
its own shard, and in the case of dynamic partitioning, etc. workers may
end up processing more than one "bundle" of items and hence produce more
than one shard. See also
https://beam.apache.org/documentation/runtime/model/

On Thu, May 13, 2021 at 3:58 PM Wenbing Bai 
wrote:

> Hi team,
>
> I have another question when using Beam Dataframe IO connector. I tried
> to_parquet, and my data are written to several different files. I am
> wondering how I can control the number of files (shards) or how the
> sharding is done for to_parquet and other Beam Dataframe IO APIs?
>
> Thank you!
> Wenbing
>
> On Tue, May 11, 2021 at 12:20 PM Kenneth Knowles  wrote:
>
>> +dev 
>>
>> In the Beam Java ecosystem, this functionality is provided by the Sorter
>> library (
>> https://beam.apache.org/documentation/sdks/java-extensions/#sorter). I'm
>> curious what people think about various options:
>>
>>  - Python version of the transform(s)
>>  - Expose sorter as xlang transform(s)
>>  - Convenience transforms (that use pandas in DoFns?) to just do it for
>> small data per key to achieve compatibility
>>  - Beam model extension so that runners can do it as part of GBK
>>
>> Kenn
>>
>> On Mon, May 10, 2021 at 5:26 PM Wenbing Bai 
>> wrote:
>>
>>> Hi Robert and Brian,
>>>
>>> I don't know why I didn't catch your replies. But thank you so much for
>>> looking at this.
>>>
>>> My parquet files will be consumed by downstreaming processes which
>>> require data points with the same "key1" that are sorted by "key2". The
>>> downstreaming process, for example, will make a rolling window with size N
>>> that reads N records together at one time. But note, the rolling window
>>> will not cross different "key1".
>>>
>>> So that is saying, 1) I don't need to sort the whole dataset. 2) all
>>> data with the same "key1" should be located together.
>>>
>>> I am not sure if I explain the use case clearly. Let me know what you
>>> think.
>>>
>>> Wenbing
>>>
>>>
>>> On Tue, Apr 20, 2021 at 5:01 PM Robert Bradshaw 
>>> wrote:
>>>
 It would also be helpful to understand what your overall objective is
 with this output. Is there a reason you need it sorted/partitioned in a
 certain way?

 On Tue, Apr 20, 2021 at 4:51 PM Brian Hulette 
 wrote:

> Hi Wenbing,
> Sorry for taking so long to get back to you on this.
> I discussed this with Robert offline and we came up with a potential
> workaround - you could try writing out the Parquet file from within the
> groupby.apply method. You can use beam's FileSystems abstraction to open a
> Python file object referencing a cloud storage file, and pass that file
> object directly to the pandas to_parquet. It would look something like 
> this:
>
>   df.groupby('key1').apply(lambda df:
> df.sort_values(by='key2').to_parquet(FileSystems.open("gs://bucket/file.pq"))
>
> If writing out sorted, partitioned parquet files is a common use-case
> we should think about making this easier though. At the very least
> partition_cols should work, I filed BEAM-12201 [1] for this. That alone
> won't be enough as our implementation will likely reshuffle the dataset to
> enforce the partitioning, removing any sorting that you've applied, so 
> we'd
> also need to think about how to optimize the pipeline to avoid that 
> shuffle.
>
> Brian
>
> [1] https://issues.apache.org/jira/browse/BEAM-12201
>
> On Wed, Apr 7, 2021 at 9:02 PM Wenbing Bai 
> wrote:
>
>> Thank you, Brian. I tried `partition_cols`, but it is not working. I
>> tried pure pandas, it does work, so I am not sure if anything wrong with
>> Beam.
>>
>> Wenbing
>>
>> On Wed, Apr 7, 2021 at 2:56 PM Brian Hulette 
>> wrote:
>>
>>> Hm, to_parquet does have a `partition_cols` argument [1] which we
>>> pass through [2]. It would be interesting to see what
>>> `partition_cols='key1'` does - I suspect it won't work perfectly though.
>>>
>>> Do you have any thoughts here Robert?
>>>
>>> [1]
>>> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html
>>> [2]
>>> https://github.com/apache/beam/blob/a8cd05932bed9b2480316fb8518409636cb2733b/sdks/python/apache_beam/dataframe/io.py#L525
>>>
>>> On Wed, Apr 7, 2021 at 2:22 PM Wenbing Bai <
>>> wenbing@getcruise.com> wrote:
>>>
 Hi Robert and Brian,

 I tried groupby in my case. Here is my pipeline code. I do see all
 the data in the final parquet file are sorted in each group. However, 
 I'd
 like to write each partition (group) to an individual file, how can I
 achieve it? In addition, I am using the master of Apache Beam SDK, how 
 can
 I test the pipeline with DataflowRunner considering there is no 
 dataflow

Re: [EXT] Re: [EXT] Re: [EXT] Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-05-13 Thread Wenbing Bai
Hi team,

I have another question when using Beam Dataframe IO connector. I tried
to_parquet, and my data are written to several different files. I am
wondering how I can control the number of files (shards) or how the
sharding is done for to_parquet and other Beam Dataframe IO APIs?

Thank you!
Wenbing

On Tue, May 11, 2021 at 12:20 PM Kenneth Knowles  wrote:

> +dev 
>
> In the Beam Java ecosystem, this functionality is provided by the Sorter
> library (
> https://beam.apache.org/documentation/sdks/java-extensions/#sorter). I'm
> curious what people think about various options:
>
>  - Python version of the transform(s)
>  - Expose sorter as xlang transform(s)
>  - Convenience transforms (that use pandas in DoFns?) to just do it for
> small data per key to achieve compatibility
>  - Beam model extension so that runners can do it as part of GBK
>
> Kenn
>
> On Mon, May 10, 2021 at 5:26 PM Wenbing Bai 
> wrote:
>
>> Hi Robert and Brian,
>>
>> I don't know why I didn't catch your replies. But thank you so much for
>> looking at this.
>>
>> My parquet files will be consumed by downstreaming processes which
>> require data points with the same "key1" that are sorted by "key2". The
>> downstreaming process, for example, will make a rolling window with size N
>> that reads N records together at one time. But note, the rolling window
>> will not cross different "key1".
>>
>> So that is saying, 1) I don't need to sort the whole dataset. 2) all data
>> with the same "key1" should be located together.
>>
>> I am not sure if I explain the use case clearly. Let me know what you
>> think.
>>
>> Wenbing
>>
>>
>> On Tue, Apr 20, 2021 at 5:01 PM Robert Bradshaw 
>> wrote:
>>
>>> It would also be helpful to understand what your overall objective is
>>> with this output. Is there a reason you need it sorted/partitioned in a
>>> certain way?
>>>
>>> On Tue, Apr 20, 2021 at 4:51 PM Brian Hulette 
>>> wrote:
>>>
 Hi Wenbing,
 Sorry for taking so long to get back to you on this.
 I discussed this with Robert offline and we came up with a potential
 workaround - you could try writing out the Parquet file from within the
 groupby.apply method. You can use beam's FileSystems abstraction to open a
 Python file object referencing a cloud storage file, and pass that file
 object directly to the pandas to_parquet. It would look something like 
 this:

   df.groupby('key1').apply(lambda df:
 df.sort_values(by='key2').to_parquet(FileSystems.open("gs://bucket/file.pq"))

 If writing out sorted, partitioned parquet files is a common use-case
 we should think about making this easier though. At the very least
 partition_cols should work, I filed BEAM-12201 [1] for this. That alone
 won't be enough as our implementation will likely reshuffle the dataset to
 enforce the partitioning, removing any sorting that you've applied, so we'd
 also need to think about how to optimize the pipeline to avoid that 
 shuffle.

 Brian

 [1] https://issues.apache.org/jira/browse/BEAM-12201

 On Wed, Apr 7, 2021 at 9:02 PM Wenbing Bai 
 wrote:

> Thank you, Brian. I tried `partition_cols`, but it is not working. I
> tried pure pandas, it does work, so I am not sure if anything wrong with
> Beam.
>
> Wenbing
>
> On Wed, Apr 7, 2021 at 2:56 PM Brian Hulette 
> wrote:
>
>> Hm, to_parquet does have a `partition_cols` argument [1] which we
>> pass through [2]. It would be interesting to see what
>> `partition_cols='key1'` does - I suspect it won't work perfectly though.
>>
>> Do you have any thoughts here Robert?
>>
>> [1]
>> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html
>> [2]
>> https://github.com/apache/beam/blob/a8cd05932bed9b2480316fb8518409636cb2733b/sdks/python/apache_beam/dataframe/io.py#L525
>>
>> On Wed, Apr 7, 2021 at 2:22 PM Wenbing Bai 
>> wrote:
>>
>>> Hi Robert and Brian,
>>>
>>> I tried groupby in my case. Here is my pipeline code. I do see all
>>> the data in the final parquet file are sorted in each group. However, 
>>> I'd
>>> like to write each partition (group) to an individual file, how can I
>>> achieve it? In addition, I am using the master of Apache Beam SDK, how 
>>> can
>>> I test the pipeline with DataflowRunner considering there is no dataflow
>>> worker image available?
>>>
>>> data = [
>>> {
>>> "key1": 1000 + i % 10,
>>> "key2": randrange(1),
>>> "feature_1": "somestring{}".format(i)
>>> } for i in range(1)
>>> ]
>>>
>>> class TestRow(typing.NamedTuple):
>>> key1: int
>>> key2: int
>>> feature_1: str
>>>
>>> with beam.Pipeline() as p:
>>> pcoll = (
>>> p
>>> | beam.Create(data)
>>> | beam.Map(lambda x:x).with_output_types(TestRow)
>>> )
>>>
>>> df = 

Flaky test issue report

2021-05-13 Thread Beam Jira Bot
This is your daily summary of Beam's current flaky tests. These are P1 issues 
because they have a major negative impact on the community and make it hard to 
determine the quality of the software.

BEAM-12322: FnApiRunnerTestWithGrpcAndMultiWorkers flaky (py precommit) 
(https://issues.apache.org/jira/browse/BEAM-12322)
BEAM-12311: Python PostCommit are close to timeout 
(https://issues.apache.org/jira/browse/BEAM-12311)
BEAM-12309: PubSubIntegrationTest.test_streaming_data_only flake 
(https://issues.apache.org/jira/browse/BEAM-12309)
BEAM-12307: PubSubBigQueryIT.test_file_loads flake 
(https://issues.apache.org/jira/browse/BEAM-12307)
BEAM-12303: Flake in PubSubIntegrationTest.test_streaming_with_attributes 
(https://issues.apache.org/jira/browse/BEAM-12303)
BEAM-12293: FlinkSavepointTest.testSavepointRestoreLegacy flakes due to 
FlinkJobNotFoundException (https://issues.apache.org/jira/browse/BEAM-12293)
BEAM-12291: 
org.apache.beam.runners.flink.ReadSourcePortableTest.testExecution[streaming: 
false] is flaky (https://issues.apache.org/jira/browse/BEAM-12291)
BEAM-12200: SamzaStoreStateInternalsTest is flaky 
(https://issues.apache.org/jira/browse/BEAM-12200)
BEAM-12163: Python GHA PreCommits flake with grpc.FutureTimeoutError on SDK 
harness startup (https://issues.apache.org/jira/browse/BEAM-12163)
BEAM-12061: beam_PostCommit_SQL failing on 
KafkaTableProviderIT.testFakeNested 
(https://issues.apache.org/jira/browse/BEAM-12061)
BEAM-12019: 
apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics
 is flaky (https://issues.apache.org/jira/browse/BEAM-12019)
BEAM-11792: Python precommit failed (flaked?) installing package  
(https://issues.apache.org/jira/browse/BEAM-11792)
BEAM-11666: 
apache_beam.runners.interactive.recording_manager_test.RecordingManagerTest.test_basic_execution
 is flaky (https://issues.apache.org/jira/browse/BEAM-11666)
BEAM-11662: elasticsearch tests failing 
(https://issues.apache.org/jira/browse/BEAM-11662)
BEAM-11661: hdfsIntegrationTest flake: network not found (py38 postcommit) 
(https://issues.apache.org/jira/browse/BEAM-11661)
BEAM-11646: beam_PostCommit_XVR_Spark failing 
(https://issues.apache.org/jira/browse/BEAM-11646)
BEAM-11645: beam_PostCommit_XVR_Flink failing 
(https://issues.apache.org/jira/browse/BEAM-11645)
BEAM-11541: testTeardownCalledAfterExceptionInProcessElement flakes on 
direct runner. (https://issues.apache.org/jira/browse/BEAM-11541)
BEAM-11540: Linter sometimes flakes on apache_beam.dataframe.frames_test 
(https://issues.apache.org/jira/browse/BEAM-11540)
BEAM-10995: Java + Universal Local Runner: 
WindowingTest.testWindowPreservation fails 
(https://issues.apache.org/jira/browse/BEAM-10995)
BEAM-10987: stager_test.py::StagerTest::test_with_main_session flaky on 
windows py3.6,3.7 (https://issues.apache.org/jira/browse/BEAM-10987)
BEAM-10968: flaky test: 
org.apache.beam.sdk.metrics.MetricsTest$AttemptedMetricTests.testAttemptedDistributionMetrics
 (https://issues.apache.org/jira/browse/BEAM-10968)
BEAM-10955: Flink Java Runner test flake: Could not find Flink job  
(https://issues.apache.org/jira/browse/BEAM-10955)
BEAM-10923: Python requirements installation in docker container is flaky 
(https://issues.apache.org/jira/browse/BEAM-10923)
BEAM-10899: test_FhirIO_exportFhirResourcesGcs flake with OOM 
(https://issues.apache.org/jira/browse/BEAM-10899)
BEAM-10866: PortableRunnerTestWithSubprocesses.test_register_finalizations 
flaky on macOS (https://issues.apache.org/jira/browse/BEAM-10866)
BEAM-10590: BigQueryQueryToTableIT flaky: test_big_query_new_types 
(https://issues.apache.org/jira/browse/BEAM-10590)
BEAM-10519: 
MultipleInputsAndOutputTests.testParDoWithSideInputsIsCumulative flaky on Samza 
(https://issues.apache.org/jira/browse/BEAM-10519)
BEAM-10504: Failure / flake in ElasticSearchIOTest > 
testWriteFullAddressing and testWriteWithIndexFn 
(https://issues.apache.org/jira/browse/BEAM-10504)
BEAM-10501: CheckGrafanaStalenessAlerts and PingGrafanaHttpApi fail with 
Connection refused (https://issues.apache.org/jira/browse/BEAM-10501)
BEAM-10485: Failure / flake: ElasticsearchIOTest > testWriteWithIndexFn 
(https://issues.apache.org/jira/browse/BEAM-10485)
BEAM-10272: Failure in CassandraIOTest init: cannot create cluster due to 
netty link error (https://issues.apache.org/jira/browse/BEAM-10272)
BEAM-9649: beam_python_mongoio_load_test started failing due to mismatched 
results (https://issues.apache.org/jira/browse/BEAM-9649)
BEAM-9392: TestStream tests are all flaky 
(https://issues.apache.org/jira/browse/BEAM-9392)
BEAM-9232: BigQueryWriteIntegrationTests is flaky coercing to Unicode 
(https://issues.apache.org/jira/browse/BEAM-9232)
BEAM-9119: 
apache_beam.runners.portability.fn_api_runner_test.FnApiRunnerTest[...].test_large_elements
 is flaky 

P1 issues report

2021-05-13 Thread Beam Jira Bot
This is your daily summary of Beam's current P1 issues, not including flaky 
tests.

See https://beam.apache.org/contribute/jira-priorities/#p1-critical for the 
meaning and expectations around P1 issues.

BEAM-12324: TranslationsTest.test_run_packable_combine_* failing on 
PostCommit_Py_VR_Dataflow (https://issues.apache.org/jira/browse/BEAM-12324)
BEAM-12320: PubsubTableProviderIT.testSQLSelectsArrayAttributes[0] failing 
in SQL PostCommit (https://issues.apache.org/jira/browse/BEAM-12320)
BEAM-12316: LGPL in bundled dependencies 
(https://issues.apache.org/jira/browse/BEAM-12316)
BEAM-12310: beam_PostCommit_Java_DataflowV2 failing 
(https://issues.apache.org/jira/browse/BEAM-12310)
BEAM-12308: CrossLanguageKafkaIOTest.test_kafkaio flake 
(https://issues.apache.org/jira/browse/BEAM-12308)
BEAM-12290: TestPubsub.assertThatSubscriptionEventuallyCreated timeout does 
not work (https://issues.apache.org/jira/browse/BEAM-12290)
BEAM-12287: beam_PerformanceTests_Kafka_IO failing due to 
:sdks:java:container:pullLicenses failure 
(https://issues.apache.org/jira/browse/BEAM-12287)
BEAM-12279: Implement destination-dependent sharding in FileIO.writeDynamic 
(https://issues.apache.org/jira/browse/BEAM-12279)
BEAM-12258: SQL postcommit timing out 
(https://issues.apache.org/jira/browse/BEAM-12258)
BEAM-12256: PubsubIO.readAvroGenericRecord creates SchemaCoder that fails 
to decode some Avro logical types 
(https://issues.apache.org/jira/browse/BEAM-12256)
BEAM-12231: beam_PostRelease_NightlySnapshot failing 
(https://issues.apache.org/jira/browse/BEAM-12231)
BEAM-11959: Python Beam SDK Harness hangs when installing pip packages 
(https://issues.apache.org/jira/browse/BEAM-11959)
BEAM-11906: No trigger early repeatedly for session windows 
(https://issues.apache.org/jira/browse/BEAM-11906)
BEAM-11875: XmlIO.Read does not handle XML encoding per spec 
(https://issues.apache.org/jira/browse/BEAM-11875)
BEAM-11828: JmsIO is not acknowledging messages correctly 
(https://issues.apache.org/jira/browse/BEAM-11828)
BEAM-11755: Cross-language consistency (RequiresStableInputs) is quietly 
broken (at least on portable flink runner) 
(https://issues.apache.org/jira/browse/BEAM-11755)
BEAM-11578: `dataflow_metrics` (python) fails with TypeError (when int 
overflowing?) (https://issues.apache.org/jira/browse/BEAM-11578)
BEAM-11576: Go ValidatesRunner failure: TestFlattenDup on Dataflow Runner 
(https://issues.apache.org/jira/browse/BEAM-11576)
BEAM-11434: Expose Spanner admin/batch clients in Spanner Accessor 
(https://issues.apache.org/jira/browse/BEAM-11434)
BEAM-11227: Upgrade beam-vendor-grpc-1_26_0-0.3 to fix CVE-2020-27216 
(https://issues.apache.org/jira/browse/BEAM-11227)
BEAM-11148: Kafka commitOffsetsInFinalize OOM on Flink 
(https://issues.apache.org/jira/browse/BEAM-11148)
BEAM-11017: Timer with dataflow runner can be set multiple times (dataflow 
runner) (https://issues.apache.org/jira/browse/BEAM-11017)
BEAM-10861: Adds URNs and payloads to PubSub transforms 
(https://issues.apache.org/jira/browse/BEAM-10861)
BEAM-10670: Make non-portable Splittable DoFn the only option when 
executing Java "Read" transforms 
(https://issues.apache.org/jira/browse/BEAM-10670)
BEAM-10617: python CombineGlobally().with_fanout() cause duplicate combine 
results for sliding windows (https://issues.apache.org/jira/browse/BEAM-10617)
BEAM-10569: SpannerIO tests don't actually assert anything. 
(https://issues.apache.org/jira/browse/BEAM-10569)
BEAM-10288: Quickstart documents are out of date 
(https://issues.apache.org/jira/browse/BEAM-10288)
BEAM-10244: Populate requirements cache fails on poetry-based packages 
(https://issues.apache.org/jira/browse/BEAM-10244)
BEAM-10100: FileIO writeDynamic with AvroIO.sink not writing all data 
(https://issues.apache.org/jira/browse/BEAM-10100)
BEAM-9564: Remove insecure ssl options from MongoDBIO 
(https://issues.apache.org/jira/browse/BEAM-9564)
BEAM-9455: Environment-sensitive provisioning for Dataflow 
(https://issues.apache.org/jira/browse/BEAM-9455)
BEAM-9293: Python direct runner doesn't emit empty pane when it should 
(https://issues.apache.org/jira/browse/BEAM-9293)
BEAM-8986: SortValues may not work correct for numerical types 
(https://issues.apache.org/jira/browse/BEAM-8986)
BEAM-8985: SortValues should fail if SecondaryKey coder is not 
deterministic (https://issues.apache.org/jira/browse/BEAM-8985)
BEAM-8407: [SQL] Some Hive tests throw NullPointerException, but get marked 
as passing (Direct Runner) (https://issues.apache.org/jira/browse/BEAM-8407)
BEAM-7717: PubsubIO watermark tracking hovers near start of epoch 
(https://issues.apache.org/jira/browse/BEAM-7717)
BEAM-7716: PubsubIO returns empty message bodies for all messages read 
(https://issues.apache.org/jira/browse/BEAM-7716)
BEAM-7195: BigQuery - 404 errors for 'table not 

Re: [PROPOSAL] Preparing for Beam 2.30.0 release

2021-05-13 Thread Heejong Lee
UPDATE:

All precommit and postcommit tests are passed now:
https://github.com/apache/beam/pull/14711

We only have one open issue for Fix Version 2.30.0:
https://issues.apache.org/jira/browse/BEAM-12242?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20%22In%20Implementation%22%2C%20%22Triage%20Needed%22)%20AND%20fixVersion%20%3D%202.30.0

I will start building the RC release after we cherry-pick the last blocker.

On Thu, Apr 29, 2021 at 12:48 AM Heejong Lee  wrote:

> We have 10 open issues for Fix Version 2.30.0:
> https://issues.apache.org/jira/browse/BEAM-12242?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20%22In%20Implementation%22%2C%20%22Triage%20Needed%22)%20AND%20fixVersion%20%3D%202.30.0
>
> On Thu, Apr 29, 2021 at 12:30 AM Heejong Lee  wrote:
>
>> FYI, I just cut the 2.30.0 release branch. From now on, late commits for
>> 2.30.0 need to be cherry-picked. If you have any late commits, please make
>> sure that their Jira issues have the correct Fix Version, 2.30.0.
>>
>> On Tue, Apr 27, 2021 at 7:52 AM Kenneth Knowles  wrote:
>>
>>> SGTM. Thanks!
>>>
>>> On Mon, Apr 26, 2021 at 2:33 PM Heejong Lee  wrote:
>>>


 On Mon, Apr 26, 2021 at 10:24 AM Robert Bradshaw 
 wrote:

> Confirming that the cut date is 4/28/2021 (in two days), right?
>

 Yes, 2.30.0 branch is scheduled to be cut on April 28.


>
> On Wed, Apr 21, 2021 at 4:41 PM Tomo Suzuki 
> wrote:
> >
> > Thank you for the preparation!
> >
> > > a few responses that some high priority changes
> >
> > Would you be willing to share the items for visibility?
>
> There are several PRs in flight (or recently merged) to get
> portability working well with Dataflow for this release.
>

 We can still cherry-pick them by importance after the branch cut.


>
> >
> > On Wed, Apr 21, 2021 at 7:21 PM Kenneth Knowles 
> wrote:
> > >
> > > Also the 2.29.0 was re-cut.
> > >
> > > Usually a delay in one release should not delay the next release,
> because each release represents a certain quantity of changes. But in this
> case, the actual quantity of changes is affected by the re-cut, too.
> > >
> > > On Wed, Apr 21, 2021 at 4:12 PM Heejong Lee 
> wrote:
> > >>
> > >> Update on the 2.30.0 branch cut schedule:
> > >>
> > >> I'm thinking of delaying the branch cut a week since I've got a
> few responses that some high priority changes are still ongoing.
> > >>
> > >> The new cut date is April 28.
> > >>
> > >>
> > >> On Tue, Apr 20, 2021 at 6:07 PM Ahmet Altay 
> wrote:
> > >>>
> > >>> +1 and thank you!
> > >>>
> > >>> On Tue, Apr 20, 2021 at 4:55 PM Heejong Lee 
> wrote:
> > 
> >  Hi All,
> > 
> >  Beam 2.30.0 release is scheduled to be cut on April 21
> according to the release calendar [1]
> > 
> >  I'd like to volunteer myself to be the release manager for this
> release. I plan on cutting the release branch on the scheduled date.
> > 
> >  Any comments or objections ?
> > 
> >  Thanks,
> >  Heejong
> > 
> >  [1]
> https://calendar.google.com/calendar/u/0/embed?src=0p73sl034k80oob7seouani...@group.calendar.google.com=America/Los_Angeles
> >
> >
> >
> > --
> > Regards,
> > Tomo
>



Re: Please help triage issues!

2021-05-13 Thread Kyle Weaver
It's a little cumbersome, but you can query JIRA and export a CSV with the
labels, and run a script to count them. Also, it won't let you export
results from a query with more than 1000 results.

Here's the list from query "project = beam and created > startOfYear()"

dataframe-api 45
stale-P2 36
currently-failing 31
stale-assigned 31
website-revamp-2020 28
flake 27
zetasql-java-udf 17
portability-spark 6
portability-flink 4
test-failure 4
starter 4
MongoDB 3
Python 3
PubSubIO 3
GCP 3
pipeline-patterns 3
newbie 2
python 2
PubsubLiteIO 2
beam-fixit 2
vulnerabilities 1
documentation 1
containers 1
types 1
mongo 1
mongodb 1
elasticsearch 1
dataflow 1
java 1
Grouping 1
Windowing 1
Doc 1
Learning 1
ClassNotFoundException 1
jdbc 1
gcp 1
pubsub 1
pubsubio 1
apache-beam 1
ClassCastException 1
JdbcIO 1
MySQL 1
easyfix 1
pull-request-available 1
gsoc 1
gsoc2021 1
mentor 1
python-sqltransform 1
OOM 1
AWS 1
multi-threading 1
S3 1
log4j 1
log-aggregation 1
"logging 1
" 1
SLF4J 1
google-cloud-spanner 1
kafka 1
savepoints 1
flaky-test 1
website-revamp-sprint-12 1
structured-streaming 1
nexmark 1


On Wed, May 12, 2021 at 3:10 PM Valentyn Tymofieiev 
wrote:

> Is there a way to see the list of labels used in Beam ? I found a
> discussion on using labels gadget and some SQL queries to pull the
> labels[1], but did not find a way to use them  - does anyone have hands-on
> experience with any of these approaches? Does adding a gadget require PMC
> privileges?
>
> Thanks!
>
> [1]
> https://community.atlassian.com/t5/Jira-questions/Is-there-a-way-to-get-a-list-of-all-labels-being-used-in-a/qaq-p/344778
>
> On Mon, Mar 29, 2021 at 10:59 AM Kenneth Knowles  wrote:
>
>> We are down to about 550.
>>
>> I randomly selected some long-time contributors who I am sure know about
>> components and priorities well enough. There are 10-15 issues across a
>> number of people. If these are already good, then it would close out a lot
>> of them and help focus on the ones that need attention.
>>
>> This Jira search searches by "current user" so you should see the bugs
>> that you have reported that are still marked as "Triage Needed". Take a
>> quick look and if you are confident you got the components, priority,
>> labels (especially "currently-failing" and "flake") then you could bulk
>> edit them to "Open" status:
>>
>>
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20%3D%20%22Triage%20Needed%22%20AND%20reporter%20in%20(currentUser())
>>
>> Kenn
>>
>> On Mon, Mar 15, 2021 at 10:28 AM Tyson Hamilton 
>> wrote:
>>
>>> There is a 'Triaged' button that I click:
>>> https://photos.app.goo.gl/Ub5Qwnpp6aFrmaDZ9
>>>
>>> On Mon, Mar 15, 2021 at 9:48 AM Alex Amato  wrote:
>>>
 (Do I need certain permissions to be able to do this?)

 On Mon, Mar 15, 2021 at 9:47 AM Alex Amato  wrote:

> Would you mind posting a screenshot of exactly where you are supposed
> to click to move a jira issue to "Open" status? I honestly can't find 
> where
> to click. I don't see the option in the edit dialog box
>
> On Sun, Mar 14, 2021 at 8:03 PM Kenneth Knowles 
> wrote:
>
>> No need for feeling any guilt :-)
>>
>> I'm just hoping that by everyone randomly doing a very small amount
>> of work, this could be in good shape very quickly. I've done a number of
>> bulk edits like automated dependency upgrade requests which brings the
>> number down to just over 600.
>>
>> Your message does highlight some easy cases: issues filed to track
>> your own feature work. I did built automation for this: "On Issue 
>> Created"
>> -> "If Assignee == Issue Creator" -> "Transition to 'Open'". If the
>> automation isn't working, that can probably be fixed. Some of the issues
>> might just predate the automation.
>>
>> To be super clear: I don't mean to ask anyone to waste time looking
>> at things that don't need attention, but to be able to notice things that
>> do need attention. I did a few manually too, and the components, issue
>> type, and priority very often need fixing up. I especially want to get
>> untriaged P0s and P1s to zero.
>>
>> Kenn
>>
>> On Fri, Mar 12, 2021 at 5:07 PM Tyson Hamilton 
>> wrote:
>>
>>> I'm guilty of creating issues and not moving them to 'open'. I'll do
>>> better to move them to open in the future. To recompense I will spend 
>>> some
>>> additional time triaging =)
>>>
>>> Thanks for the review of the flow.
>>>
>>> On Thu, Mar 11, 2021 at 12:39 PM Kenneth Knowles 
>>> wrote:
>>>
 Hi all,

 You may or may not think about this very often, but our Jira
 workflow goes like this:

 Needs Triage --> Open --> In Progress --> Resolved

 "Needs Triage" means someone needs to look at it briefly:

  - component(s)
  - label(s)
  - issue type

Re: Upgrading vendored gRPC from 1.26.0 to 1.36.0

2021-05-13 Thread Tomo Suzuki
I'm giving up! Can anyone troubleshoot this gRPC concurrency problem
further?
My current view of the problem (link
) is that
"grpc-default-executor" threads stop processing the data. But I cannot tell
why.

I also raised an question to grpc-java on how best to troubleshoot such
situation
https://github.com/grpc/grpc-java/issues/8174

On Wed, May 12, 2021 at 11:29 PM Tomo Suzuki  wrote:

> Update: still the root cause of is unknown.
>
> From my observation with debug logging and thread dump,
> "grpc-default-executor-XXX" threads disappear when the problematic tests
> become hung.
> More notes:
> https://github.com/apache/beam/pull/14768#issuecomment-840228795
>
> Interestingly the "grpc-default-executor-XXX" threads reappear in the logs
> when the pause triggers a 5-second timeout set by JUnit.
>
>
> On Tue, May 11, 2021 at 1:12 PM Tomo Suzuki  wrote:
>
>> Thank you for the advice. Yes, the latch not being counted-down is the
>> problem. (my memo:
>> https://github.com/apache/beam/pull/14474#discussion_r619557479 ) I'll
>> need to figure out why withOnError is not called.
>>
>>
>> > Can you repro locally?
>>
>> No, the task succeeds in my environment (./gradlew
>> :runners:google-cloud-dataflow-java:worker:test).
>>
>>
>> On Tue, May 11, 2021 at 12:34 PM Kenneth Knowles  wrote:
>>
>>> I am not sure how much you read the code of the test. So apologies if I
>>> am saying things you already know. The test does something like:
>>>
>>>  - start a logging service
>>>  - set up some stub clients, each with onError wired up to release a
>>> countdown latch
>>>  - send error responses to all three of them (actually it sends the
>>> error in the same task it creates the stub)
>>>  - each task waits on the latch
>>>
>>> So if onError does not deliver or does not call to release the countdown
>>> latch, it will hang. I notice in the gist you provide that all three stub
>>> clients are hung awaiting the latch. That is suspicious to me. I would want
>>> to confirm if the flakiness always occurs in a way that hangs all three.
>>> Then there are gRPC workers waiting on empty queues, and the main test
>>> thread waiting for the hung tasks to complete.
>>>
>>> The problem could be something about the test set up. Personally I would
>>> add a ton of logs, or potentially use a debugger, to confirm exactly the
>>> state of things when it hangs. Can you repro locally? I think this same
>>> functionality could be tested in different ways that might remove some of
>>> the variables. For example starting up all the waiting tasks, then sending
>>> all the onError messages that should cause them to terminate.
>>>
>>> Since this is a unit test, adding a timeout to just that method should
>>> save time (but will make it harder to capture stack traces, etc). I've
>>> opened up https://github.com/apache/beam/pull/14781 for that. There may
>>> be a nice way to add a timeout to the executor to capture the hung stack,
>>> but I didn't look for it.
>>>
>>> Kenn
>>>
>>> On Tue, May 11, 2021 at 7:36 AM Tomo Suzuki  wrote:
>>>
 gRPC 1.37.0 showed the same problem:
 BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer
 waits tasks forever, causing timeout in Java precommit.

 While I continue my investigation, I appreciate if someone knows the
 cause of the problem, I pasted the thread dump of the Java process when the
 test was frozen:
 https://github.com/apache/beam/pull/14768

 If this mystery is never solved, vendoring (a bit old) gRPC 1.32.2
 without the jboss dependencies is an alternate option, (suggestion by Kenn;
 memo
 
 )

 Regards,
 Tomo


 On Mon, May 10, 2021 at 9:40 AM Tomo Suzuki  wrote:

> I was investigating the strange timeout (
> https://github.com/apache/beam/pull/14474) but was occupied with
> something else lately.
> Let me try the new version today to see any improvements.
>
>
> On Mon, May 10, 2021 at 4:57 AM Ismaël Mejía 
> wrote:
>
>> I just saw that gRPC 1.37.1 is out now (and with aarch64 support for
>> python!) that made me wonder about this, what is the current status of
>> upgrading the vendored dependency Tomo?
>>
>>
>> On Thu, Apr 8, 2021 at 4:16 PM Tomo Suzuki 
>> wrote:
>>
>>> We observed the cron job of Java Precommit for the master branch
>>> started timing out often (not always) since upgrading the gRPC version.
>>> https://github.com/apache/beam/pull/14466#issuecomment-815343974
>>>
>>> Exchanged messages with Kenn, I reverted to the change; now the
>>> master branch uses the vendored gRPC 1.26.
>>>
>>>
>>> On Wed, Mar 31, 2021 at 11:40 AM Kenneth Knowles 
>>> wrote: