Flaky test issue report (29)

2021-07-07 Thread Beam Jira Bot
This is your daily summary of Beam's current flaky tests 
(https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20statusCategory%20!%3D%20Done%20AND%20labels%20%3D%20flake)

These are P1 issues because they have a major negative impact on the community 
and make it hard to determine the quality of the software.

https://issues.apache.org/jira/browse/BEAM-12515: Python PreCommit flaking 
in PipelineOptionsTest.test_display_data (created 2021-06-18)
https://issues.apache.org/jira/browse/BEAM-12322: 
FnApiRunnerTestWithGrpcAndMultiWorkers flaky (py precommit) (created 2021-05-10)
https://issues.apache.org/jira/browse/BEAM-12309: 
PubSubIntegrationTest.test_streaming_data_only flake (created 2021-05-07)
https://issues.apache.org/jira/browse/BEAM-12307: 
PubSubBigQueryIT.test_file_loads flake (created 2021-05-07)
https://issues.apache.org/jira/browse/BEAM-12303: Flake in 
PubSubIntegrationTest.test_streaming_with_attributes (created 2021-05-06)
https://issues.apache.org/jira/browse/BEAM-12291: 
org.apache.beam.runners.flink.ReadSourcePortableTest.testExecution[streaming: 
false] is flaky (created 2021-05-05)
https://issues.apache.org/jira/browse/BEAM-12200: 
SamzaStoreStateInternalsTest is flaky (created 2021-04-20)
https://issues.apache.org/jira/browse/BEAM-12163: Python GHA PreCommits 
flake with grpc.FutureTimeoutError on SDK harness startup (created 2021-04-13)
https://issues.apache.org/jira/browse/BEAM-12061: beam_PostCommit_SQL 
failing on KafkaTableProviderIT.testFakeNested (created 2021-03-27)
https://issues.apache.org/jira/browse/BEAM-12019: 
apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics
 is flaky (created 2021-03-18)
https://issues.apache.org/jira/browse/BEAM-11792: Python precommit failed 
(flaked?) installing package  (created 2021-02-10)
https://issues.apache.org/jira/browse/BEAM-11666: 
apache_beam.runners.interactive.recording_manager_test.RecordingManagerTest.test_basic_execution
 is flaky (created 2021-01-20)
https://issues.apache.org/jira/browse/BEAM-11661: hdfsIntegrationTest 
flake: network not found (py38 postcommit) (created 2021-01-19)
https://issues.apache.org/jira/browse/BEAM-11645: beam_PostCommit_XVR_Flink 
failing (created 2021-01-15)
https://issues.apache.org/jira/browse/BEAM-11541: 
testTeardownCalledAfterExceptionInProcessElement flakes on direct runner. 
(created 2020-12-30)
https://issues.apache.org/jira/browse/BEAM-10968: flaky test: 
org.apache.beam.sdk.metrics.MetricsTest$AttemptedMetricTests.testAttemptedDistributionMetrics
 (created 2020-09-25)
https://issues.apache.org/jira/browse/BEAM-10955: Flink Java Runner test 
flake: Could not find Flink job  (created 2020-09-23)
https://issues.apache.org/jira/browse/BEAM-10866: 
PortableRunnerTestWithSubprocesses.test_register_finalizations flaky on macOS 
(created 2020-09-09)
https://issues.apache.org/jira/browse/BEAM-10485: Failure / flake: 
ElasticsearchIOTest > testWriteWithIndexFn (created 2020-07-14)
https://issues.apache.org/jira/browse/BEAM-9649: 
beam_python_mongoio_load_test started failing due to mismatched results 
(created 2020-03-31)
https://issues.apache.org/jira/browse/BEAM-9232: 
BigQueryWriteIntegrationTests is flaky coercing to Unicode (created 2020-01-31)
https://issues.apache.org/jira/browse/BEAM-8101: Flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundleStateful for 
Direct, Spark, Flink (created 2019-08-27)
https://issues.apache.org/jira/browse/BEAM-8035: 
WatchTest.testMultiplePollsWithManyResults flake: Outputs must be in timestamp 
order (sickbayed) (created 2019-08-22)
https://issues.apache.org/jira/browse/BEAM-7992: Unhandled type_constraint 
in 
apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types
 (created 2019-08-16)
https://issues.apache.org/jira/browse/BEAM-7827: 
MetricsTest$AttemptedMetricTests.testAllAttemptedMetrics is flaky on 
DirectRunner (created 2019-07-26)
https://issues.apache.org/jira/browse/BEAM-7752: Java Validates 
DirectRunner: testTeardownCalledAfterExceptionInFinishBundleStateful flaky 
(created 2019-07-16)
https://issues.apache.org/jira/browse/BEAM-6804: [beam_PostCommit_Java] 
[PubsubReadIT.testReadPublicData] Timeout waiting on Sub (created 2019-03-11)
https://issues.apache.org/jira/browse/BEAM-5286: 
[beam_PostCommit_Java_GradleBuild][org.apache.beam.examples.subprocess.ExampleEchoPipelineTest.testExampleEchoPipeline][Flake]
 .sh script: text file busy. (created 2018-09-01)
https://issues.apache.org/jira/browse/BEAM-5172: 
org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky (created 
2018-08-20)


P1 issues report (42)

2021-07-07 Thread Beam Jira Bot
This is your daily summary of Beam's current P1 issues, not including flaky 
tests 
(https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20statusCategory%20!%3D%20Done%20AND%20priority%20%3D%20P1%20AND%20(labels%20is%20EMPTY%20OR%20labels%20!%3D%20flake).

See https://beam.apache.org/contribute/jira-priorities/#p1-critical for the 
meaning and expectations around P1 issues.

https://issues.apache.org/jira/browse/BEAM-12583: 
beam_PreCommit_Java_Phrase fails (created 2021-07-07)
https://issues.apache.org/jira/browse/BEAM-12525: SDF BoundedSource seems 
to execute significantly slower than 'normal' BoundedSource (created 2021-06-22)
https://issues.apache.org/jira/browse/BEAM-12500: Dataflow SocketException 
(SSLException) error while trying to send message from Cloud Pub/Sub to 
BigQuery (created 2021-06-16)
https://issues.apache.org/jira/browse/BEAM-12484: JdbcIO date conversion is 
sensitive to OS (created 2021-06-14)
https://issues.apache.org/jira/browse/BEAM-12467: 
java.io.InvalidClassException With Flink Kafka (created 2021-06-09)
https://issues.apache.org/jira/browse/BEAM-12436: 
[beam_PostCommit_Go_VR_flink| beam_PostCommit_Go_VR_spark] 
[:sdks:go:test:flinkValidatesRunner] Failure summary (created 2021-06-01)
https://issues.apache.org/jira/browse/BEAM-12422: Vendored gRPC 1.36.0 is 
using a log4j version with security issues (created 2021-05-28)
https://issues.apache.org/jira/browse/BEAM-12396: 
beam_PostCommit_XVR_Direct failed (flaked?) (created 2021-05-24)
https://issues.apache.org/jira/browse/BEAM-12389: 
beam_PostCommit_XVR_Dataflow flaky: Expand method not found (created 2021-05-21)
https://issues.apache.org/jira/browse/BEAM-12387: beam_PostCommit_Python* 
timing out (created 2021-05-21)
https://issues.apache.org/jira/browse/BEAM-12386: 
beam_PostCommit_Py_VR_Dataflow(_V2) failing metrics tests (created 2021-05-21)
https://issues.apache.org/jira/browse/BEAM-12380: Go SDK Kafka IO Transform 
implemented via XLang (created 2021-05-21)
https://issues.apache.org/jira/browse/BEAM-12374: Spark postcommit failing 
ResumeFromCheckpointStreamingTest (created 2021-05-20)
https://issues.apache.org/jira/browse/BEAM-12320: 
PubsubTableProviderIT.testSQLSelectsArrayAttributes[0] failing in SQL 
PostCommit (created 2021-05-10)
https://issues.apache.org/jira/browse/BEAM-12310: 
beam_PostCommit_Java_DataflowV2 failing (created 2021-05-07)
https://issues.apache.org/jira/browse/BEAM-12279: Implement 
destination-dependent sharding in FileIO.writeDynamic (created 2021-05-04)
https://issues.apache.org/jira/browse/BEAM-12256: 
PubsubIO.readAvroGenericRecord creates SchemaCoder that fails to decode some 
Avro logical types (created 2021-04-29)
https://issues.apache.org/jira/browse/BEAM-11959: Python Beam SDK Harness 
hangs when installing pip packages (created 2021-03-11)
https://issues.apache.org/jira/browse/BEAM-11906: No trigger early 
repeatedly for session windows (created 2021-03-01)
https://issues.apache.org/jira/browse/BEAM-11875: XmlIO.Read does not 
handle XML encoding per spec (created 2021-02-26)
https://issues.apache.org/jira/browse/BEAM-11828: JmsIO is not 
acknowledging messages correctly (created 2021-02-17)
https://issues.apache.org/jira/browse/BEAM-11755: Cross-language 
consistency (RequiresStableInputs) is quietly broken (at least on portable 
flink runner) (created 2021-02-05)
https://issues.apache.org/jira/browse/BEAM-11578: `dataflow_metrics` 
(python) fails with TypeError (when int overflowing?) (created 2021-01-06)
https://issues.apache.org/jira/browse/BEAM-11434: Expose Spanner 
admin/batch clients in Spanner Accessor (created 2020-12-10)
https://issues.apache.org/jira/browse/BEAM-11148: Kafka 
commitOffsetsInFinalize OOM on Flink (created 2020-10-28)
https://issues.apache.org/jira/browse/BEAM-11017: Timer with dataflow 
runner can be set multiple times (dataflow runner) (created 2020-10-05)
https://issues.apache.org/jira/browse/BEAM-10670: Make non-portable 
Splittable DoFn the only option when executing Java "Read" transforms (created 
2020-08-10)
https://issues.apache.org/jira/browse/BEAM-10617: python 
CombineGlobally().with_fanout() cause duplicate combine results for sliding 
windows (created 2020-07-31)
https://issues.apache.org/jira/browse/BEAM-10569: SpannerIO tests don't 
actually assert anything. (created 2020-07-23)
https://issues.apache.org/jira/browse/BEAM-10529: Kafka XLang fails for 
?empty? key/values (created 2020-07-18)
https://issues.apache.org/jira/browse/BEAM-10288: Quickstart documents are 
out of date (created 2020-06-19)
https://issues.apache.org/jira/browse/BEAM-10244: Populate requirements 
cache fails on poetry-based packages (created 2020-06-11)
https://issues.apache.org/jira/browse/BEAM-10100: FileIO writeDynamic with 
AvroIO.sink not writing all data (created 2020-05-27)

Re: [VOTE] Release 2.31.0, release candidate #1

2021-07-07 Thread Kenneth Knowles
+1 ran a couple more validation of configuration that aren't in the
scripts, etc.

On Fri, Jul 2, 2021 at 6:51 PM Ahmet Altay  wrote:

>
>
> On Fri, Jul 2, 2021 at 9:57 AM Andrew Pilloud  wrote:
>
>> Thanks for noticing that! The key was updated prior to signature. I need
>> a PMC member's help to copy the new keys to the release folder:
>>
>
> Done.
>
>
>>
>> svn cp https://dist.apache.org/repos/dist/dev/beam/KEYS
>> https://dist.apache.org/repos/dist/
>> release/beam/KEYS
>>
>> On Fri, Jul 2, 2021 at 9:27 AM Robert Bradshaw 
>> wrote:
>>
>>> The release artifacts are all signed with an expired key. Other than
>>> that it looks good.
>>>
>>> On Wed, Jun 30, 2021 at 6:48 PM Ahmet Altay  wrote:
>>> >
>>> > +1 (binding) I ran python quick starts with directrunner.
>>> >
>>> > Thank you!
>>> > Ahmet
>>> >
>>> > On Wed, Jun 30, 2021 at 4:48 PM Chamikara Jayalath <
>>> chamik...@google.com> wrote:
>>> >>
>>> >> +1 (binding)
>>> >>
>>> >> Ran some Java and multi-language validations.
>>> >>
>>> >> Thanks,
>>> >> Cham
>>> >>
>>> >> On Tue, Jun 29, 2021 at 12:32 PM Andrew Pilloud 
>>> wrote:
>>> >>>
>>> >>> This is the Beam release, but Dataflow container images should be
>>> available now if you want to test with that.
>>> >>>
>>> >>> On Mon, Jun 28, 2021 at 7:32 PM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>> 
>>>  Looks like Dataflow container images have not been released yet.
>>> 
>>>  On Mon, Jun 28, 2021 at 5:09 PM Andrew Pilloud 
>>> wrote:
>>> >
>>> > Hi everyone,
>>> > Please review and vote on the release candidate #1 for the version
>>> 2.31.0, as follows:
>>> > [ ] +1, Approve the release
>>> > [ ] -1, Do not approve the release (please provide specific
>>> comments)
>>> >
>>> >
>>> > Reviewers are encouraged to test their own use cases with the
>>> release candidate, and vote +1 if no issues are found.
>>> >
>>> > The complete staging area is available for your review, which
>>> includes:
>>> > * JIRA release notes [1],
>>> > * the official Apache source release to be deployed to
>>> dist.apache.org [2], which is signed with the key with fingerprint
>>> 9F8AE3D4 [3],
>>> > * all artifacts to be deployed to the Maven Central Repository [4],
>>> > * source code tag "v2.31.0-RC1" [5],
>>> > * website pull request listing the release [6], publishing the API
>>> reference manual [7], and the blog post [6].
>>> > * Java artifacts were built with Gradle 6.8.3 and OpenJDK/Oracle
>>> JDK 1.8.0_232.
>>> > * Python artifacts are deployed along with the source release to
>>> the dist.apache.org [2] and to pypy [9].
>>> > * Validation sheet with a tab for 2.31.0 release to help with
>>> validation [10].
>>> > * Docker images published to Docker Hub [11].
>>> >
>>> > The vote will be open for at least 72 hours. It is adopted by
>>> majority approval, with at least 3 PMC affirmative votes.
>>> >
>>> > For guidelines on how to try the release in your projects, check
>>> out our blog post at https://beam.apache.org/blog/validate-beam-release/
>>> .
>>> >
>>> > Thanks,
>>> > Release Manager
>>> >
>>> > [1]
>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12349991
>>> > [2] https://dist.apache.org/repos/dist/dev/beam/2.31.0/
>>> > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> > [4]
>>> https://repository.apache.org/content/repositories/orgapachebeam-1175/
>>> > [5] https://github.com/apache/beam/tree/v2.31.0-RC1
>>> > [6] https://github.com/apache/beam/pull/15068
>>> > [7] https://github.com/apache/beam-site/pull/615
>>> > [8] https://pypi.org/project/apache-beam/2.31.0rc1/
>>> > [9]
>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1518714176
>>> > [10] https://hub.docker.com/search?q=apache%2Fbeam=image
>>>
>>


Re: [IDEA] Privacy (and security) in Apache Beam

2021-07-07 Thread Kenneth Knowles
This is a whole area of research that is really cool. Something I read a
while ago and think about sometimes: Explaining output in Modern Data
Analytics (writeup here
https://blog.acolyer.org/2017/02/01/explaining-outputs-in-modern-data-analytics/
)

Kenn

On Wed, Jul 7, 2021 at 8:18 AM Reuven Lax  wrote:

> One interesting area to explore is lineage - can every output record from
> a pipeline be tracked back to its source input record. This gets
> interesting with aggregations, where multiple input records combine to
> create a single output record.
>
> On Wed, Jul 7, 2021 at 6:11 AM Guillermo Rodríguez Cano 
> wrote:
>
>> Hello!
>>
>> I am wondering if there is anyone interested in exploring the topic of
>> privacy (and potentially security) in the Apache Beam unified programming
>> model.
>>
>> I have been a user of Apache Beam mostly via Tensorflow Transform but
>> also directly and followed its evolution and development early on.
>> However, given my research background, I have always wondered about the
>> topics of privacy and security when processing large amounts of data with,
>> for example, Apache Beam.
>>
>> There is some work on the topic of differential privacy and how to
>> achieve that practically.
>> But I would like to explore and go beyond as I think the problem is much
>> broader and requires a wider analysis to have it addressed in different
>> angles or directions.
>>
>> Is there anyone in this list interested to discuss the topic and explore
>> ideas? I would be happy to coordinate some special interest group if that
>> makes it easier.
>> Or maybe you know someone who would be interested or point me to where to
>> head :)
>>
>> /Guillermo
>>
>


Re: Help: Apache Beam Session Window with limit on number of events and time elapsed from window start

2021-07-07 Thread Kenneth Knowles
Hi Chandan,

I am moving this thread to u...@beam.apache.org. I think that is the best
place to discuss.

Kenn

On Wed, Jul 7, 2021 at 9:32 AM Chandan Bhattad 
wrote:

> Hi Team,
>
> Hope you are doing well.
>
> I have a use case around session windowing with some customizations.
>
> We need to have create user sessions based on *any *of the 3 conditions
> below
>
> 1. Session Window of 30 minutes (meaning, 30 minutes of inactivity i.e. no
> event for 30 minutes for a user)
> 2. Number of events in the session window reaches 20,000
> 3. 4 hours have elapsed since window start
>
> Below is what I have tried.
>
> beam.WindowInto(window.Sessions(session_timeout_seconds),
> trigger=trigger.Repeatedly(
> trigger.AfterAny(
> trigger.AfterCount(2),
> trigger.DefaultTrigger(),
> TriggerWhenWindowStartPassesXHours(hours=0.2)
> )
> ),
> timestamp_combiner=window.TimestampCombiner.OUTPUT_AT_EOW,
> accumulation_mode=trigger.AccumulationMode.DISCARDING
> )
>
>
> # Custom Trigger Implementation
> from apache_beam.transforms.trigger import DefaultTrigger
> from apache_beam.utils.timestamp import Timestamp
>
>
> class TriggerWhenWindowStartPassesXHours(DefaultTrigger):
>
> def __init__(self, hours=4):
> super(TriggerWhenWindowStartPassesXHours, self).__init__()
> self.hours = hours
>
> def __repr__(self):
> return 'TriggerWhenWindowStartPassesXHours()'
>
> def should_fire(self, time_domain, watermark, window, context):
> should_fire = (Timestamp.now() - window.start).micros >= 36 * 
> self.hours
> return should_fire
>
> @staticmethod
> def from_runner_api(proto, context):
> return TriggerWhenWindowStartPassesXHours()
>
> The above works well, but there is an issue. Whenever Trigger No. 3 above
> fires -- it does not create a new session window, but the same window is
> continued.
> What happens due to this is, the No. 3 would keep on firing on every
> subsequent after 4 hours since window start, since should_fire condition is:
>
> should_fire = (Timestamp.now() - window.start).micros >= 36 * 
> self.hours
>
> and since window.start is never updated after the first time trigger is
> fired - it will fire for every subsequent event after the first trigger.
>
> I have also posted this on stackoverflow:
> https://stackoverflow.com/questions/68250618/apache-beam-session-window-with-limit-on-number-of-events
>
> I would be very grateful for any help as to how to achieve this.
> Thanks a lot in advance.
>
> Regards,
> Chandan
>


Help: Apache Beam Session Window with limit on number of events and time elapsed from window start

2021-07-07 Thread Chandan Bhattad
Hi Team,

Hope you are doing well.

I have a use case around session windowing with some customizations.

We need to have create user sessions based on *any *of the 3 conditions
below

1. Session Window of 30 minutes (meaning, 30 minutes of inactivity i.e. no
event for 30 minutes for a user)
2. Number of events in the session window reaches 20,000
3. 4 hours have elapsed since window start

Below is what I have tried.

beam.WindowInto(window.Sessions(session_timeout_seconds),
trigger=trigger.Repeatedly(
trigger.AfterAny(
trigger.AfterCount(2),
trigger.DefaultTrigger(),
TriggerWhenWindowStartPassesXHours(hours=0.2)
)
),
timestamp_combiner=window.TimestampCombiner.OUTPUT_AT_EOW,
accumulation_mode=trigger.AccumulationMode.DISCARDING
)


# Custom Trigger Implementation
from apache_beam.transforms.trigger import DefaultTrigger
from apache_beam.utils.timestamp import Timestamp


class TriggerWhenWindowStartPassesXHours(DefaultTrigger):

def __init__(self, hours=4):
super(TriggerWhenWindowStartPassesXHours, self).__init__()
self.hours = hours

def __repr__(self):
return 'TriggerWhenWindowStartPassesXHours()'

def should_fire(self, time_domain, watermark, window, context):
should_fire = (Timestamp.now() - window.start).micros >=
36 * self.hours
return should_fire

@staticmethod
def from_runner_api(proto, context):
return TriggerWhenWindowStartPassesXHours()

The above works well, but there is an issue. Whenever Trigger No. 3 above
fires -- it does not create a new session window, but the same window is
continued.
What happens due to this is, the No. 3 would keep on firing on every
subsequent after 4 hours since window start, since should_fire condition is:

should_fire = (Timestamp.now() - window.start).micros >= 36 * self.hours

and since window.start is never updated after the first time trigger is
fired - it will fire for every subsequent event after the first trigger.

I have also posted this on stackoverflow:
https://stackoverflow.com/questions/68250618/apache-beam-session-window-with-limit-on-number-of-events

I would be very grateful for any help as to how to achieve this.
Thanks a lot in advance.

Regards,
Chandan


Documentation error

2021-07-07 Thread Igor Mossinato
Hello everybody,

Thanks for the amazing work! I'm loving it!

But the python code on the item 2.1.1. Setting PipelineOptions from
command-line arguments is wrong.

According to Google docs

it should be:


from apache_beam.options.pipeline_options import PipelineOptions

options = PipelineOptions(flags=argv)


Regards!


Re: Spotbugs issue on project I did not modify

2021-07-07 Thread Matthew Ouyang
Thank you for the confirmation Alexey.  It worked out.

On Wed, Jul 7, 2021 at 6:03 AM Alexey Romanenko 
wrote:

> Yes, rebasing against HEAD is almost always a good idea. I did run locally
> “./gradlew :sdks:java:harness:check” (which includes “spotbugs" check)
> against current HEAD and there is no issue with spotbugsMain task.
>
> —
> Alexey
>
> On 7 Jul 2021, at 04:28, Matthew Ouyang  wrote:
>
> I get the following spotbugs failure after running spotlessApply on a
> project that I did not modify and cannot reproduce in a Windows 10
> environment.  I am also unable to see the particular failure being cited
> because a file local to the machine the build ran on is being referenced.
>
> What confuses me is that the failure occurred in the sdks/java/harness 
> project eventhough
> I didn't change it.  I also didn't find build failures for commits made
> against that project.  To be fair, my branch is based on a commit that is 3
> weeks old.  Would rebasing to HEAD master make a difference?
>
> PR: https://github.com/apache/beam/pull/15070
> Jenkins:
> https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/3779/console
>
> *03:03:22* FAILURE: Build failed with an exception.
>
> *03:03:22* *03:03:22* * What went wrong:*03:03:22* Execution failed for task 
> ':sdks:java:harness:spotbugsMain'.*03:03:22* > A failure occurred while 
> executing 
> com.github.spotbugs.snom.internal.SpotBugsRunnerForWorker$SpotBugsExecutor*03:03:22*
> > 1 SpotBugs violations were found. See the report at: 
> file:///home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_Phrase/src/sdks/java/harness/build/reports/spotbugs/main.xml
>
> ```
>
>
>


[IDEA] Privacy (and security) in Apache Beam

2021-07-07 Thread Guillermo Rodríguez Cano
Hello!

I am wondering if there is anyone interested in exploring the topic of
privacy (and potentially security) in the Apache Beam unified programming
model.

I have been a user of Apache Beam mostly via Tensorflow Transform but also
directly and followed its evolution and development early on.
However, given my research background, I have always wondered about the
topics of privacy and security when processing large amounts of data with,
for example, Apache Beam.

There is some work on the topic of differential privacy and how to achieve
that practically.
But I would like to explore and go beyond as I think the problem is much
broader and requires a wider analysis to have it addressed in different
angles or directions.

Is there anyone in this list interested to discuss the topic and explore
ideas? I would be happy to coordinate some special interest group if that
makes it easier.
Or maybe you know someone who would be interested or point me to where to
head :)

/Guillermo


Re: JavaPrecommit fails

2021-07-07 Thread Alexey Romanenko
For the reference: https://issues.apache.org/jira/browse/BEAM-12583

> On 6 Jul 2021, at 18:31, Alexey Romanenko  wrote:
> 
> Yes, it fails constantly for me too. I briefly did a "git bisect” and 
> a60aebaac9c90d393ca1e2a7445d45222d908541 is the first bad commit. It’s quite 
> strange that this test passed initially on Jenkins but it seems for me that 
> PR #14805 [1] was tested without taking into account of changes from another 
> one #14804 [2] in ShortIdMap.java where it fails now.
> 
> @Alex Amato 
> Could you take a look, please?
> 
> —
> Alexey
> 
> [1] https://github.com/apache/beam/pull/14805
> [2] https://github.com/apache/beam/pull/14804
> 
> 
>> On 6 Jul 2021, at 09:51, Jan Lukavský  wrote:
>> 
>> Looks to me to be failing consistently. At least appears so both on PRs and 
>> tested locally on master branch.
>> 
>> On 7/5/21 1:34 PM, Alexey Romanenko wrote:
>>> Hello,
>>> 
>>> JavaPrecommit fails because of 
>>> “org.apache.beam.fn.harness.control.HarnessMonitoringInfosInstructionHandlerTest.testReturnsProcessWideMonitoringInfos”
>>>  test.
>>> Is it a regression or just flaky?
>>> 
>>> —
>>> Alexey
>>> 
>>> 
> 



Re: Spotbugs issue on project I did not modify

2021-07-07 Thread Alexey Romanenko
Yes, rebasing against HEAD is almost always a good idea. I did run locally 
“./gradlew :sdks:java:harness:check” (which includes “spotbugs" check) against 
current HEAD and there is no issue with spotbugsMain task. 

—
Alexey

> On 7 Jul 2021, at 04:28, Matthew Ouyang  wrote:
> 
> I get the following spotbugs failure after running spotlessApply on a project 
> that I did not modify and cannot reproduce in a Windows 10 environment.  I am 
> also unable to see the particular failure being cited because a file local to 
> the machine the build ran on is being referenced.
> 
> What confuses me is that the failure occurred in the sdks/java/harness 
> project eventhough I didn't change it.  I also didn't find build failures for 
> commits made against that project.  To be fair, my branch is based on a 
> commit that is 3 weeks old.  Would rebasing to HEAD master make a difference?
> 
> PR: https://github.com/apache/beam/pull/15070 
> 
> Jenkins: 
> https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/3779/console 
> 
> 
> 03:03:22 FAILURE: Build failed with an exception.
> 03:03:22 
> 03:03:22 * What went wrong:
> 03:03:22 Execution failed for task ':sdks:java:harness:spotbugsMain'.
> 03:03:22 > A failure occurred while executing 
> com.github.spotbugs.snom.internal.SpotBugsRunnerForWorker$SpotBugsExecutor
> 03:03:22> 1 SpotBugs violations were found. See the report at: 
> file:///home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_Phrase/src/sdks/java/harness/build/reports/spotbugs/main.xml
>  <>
> ```



Re: FileIO with custom sharding function

2021-07-07 Thread Jozef Vilcek
On Sat, Jul 3, 2021 at 7:30 PM Reuven Lax  wrote:

>
>
> On Sat, Jul 3, 2021 at 1:02 AM Jozef Vilcek  wrote:
>
>> I don't think this has anything to do with external shuffle services.
>>
>> Arbitrarily recomputing data is fundamentally incompatible with Beam,
>> since Beam does not restrict transforms to being deterministic. The Spark
>> runner works (at least it did last I checked) by checkpointing the RDD.
>> Spark will not recompute the DAG past a checkpoint, so this creates stable
>> input to a transform. This adds some cost to the Spark runner, but makes
>> things correct. You should not have sharding problems due to replay unless
>> there is a bug in the current Spark runner.
>>
>> Beam does not restrict non-determinism, true. If users do add it, then
>> they can work around it should they need to address any side effects. But
>> should non-determinism be deliberately added by Beam core? Probably can if
>> runners can 100% deal with that effectively. To the point of RDD
>> `checkpoint`, afaik SparkRunner does use `cache`, not checkpoint. Also, I
>> thought cache is invoked on fork points in DAG. I have just a join and map
>> and in such cases I thought data is always served out by shuffle service.
>> Am I mistaken?
>>
>
> Non determinism is already there  in core Beam features. If any sort of
> trigger is used anywhere in the pipeline, the result is non deterministic.
> We've also found that users tend not to know when their ParDos are non
> deterministic, so telling users to works around it tends not to work.
>
> The Spark runner definitely used to use checkpoint in this case.
>

It is beyond my knowledge level of Beam how triggers introduce
non-determinism into the code processing in batch mode. So I leave that
out. Reuven, do you mind pointing me to the Spark runner code which is
supposed to handle this by using checkpoint? I can not find it myself.



>
>>
>> On Fri, Jul 2, 2021 at 8:23 PM Reuven Lax  wrote:
>>
>>>
>>>
>>> On Tue, Jun 29, 2021 at 1:03 AM Jozef Vilcek 
>>> wrote:
>>>

 How will @RequiresStableInput prevent this situation when running batch
> use case?
>

 So this is handled in combination of @RequiresStableInput and output
 file finalization. @RequiresStableInput (or Reshuffle for most runners)
 makes sure that the input provided for the write stage does not get
 recomputed in the presence of failures. Output finalization makes sure that
 we only finalize one run of each bundle and discard the rest.

 So this I believe relies on having robust external service to hold
 shuffle data and serve it out when needed so pipeline does not need to
 recompute it via non-deterministic function. In Spark however, shuffle
 service can not be (if I am not mistaking) deployed in this fashion (HA +
 replicated shuffle data). Therefore, if instance of shuffle service holding
 a portion of shuffle data fails, spark recovers it by recomputing parts of
 a DAG from source to recover lost shuffle results. I am not sure what Beam
 can do here to prevent it or make it stable? Will @RequiresStableInput work
 as expected? ... Note that this, I believe, concerns only the batch.

>>>
>>> I don't think this has anything to do with external shuffle services.
>>>
>>> Arbitrarily recomputing data is fundamentally incompatible with Beam,
>>> since Beam does not restrict transforms to being deterministic. The Spark
>>> runner works (at least it did last I checked) by checkpointing the RDD.
>>> Spark will not recompute the DAG past a checkpoint, so this creates stable
>>> input to a transform. This adds some cost to the Spark runner, but makes
>>> things correct. You should not have sharding problems due to replay unless
>>> there is a bug in the current Spark runner.
>>>
>>>

 Also, when withNumShards() truly have to be used, round robin
 assignment of elements to shards sounds like the optimal solution (at least
 for the vast majority of pipelines)

 I agree. But random seed is my problem here with respect to the
 situation mentioned above.

 Right, I would like to know if there are more true use-cases before
 adding this. This essentially allows users to map elements to exact output
 shards which could change the characteristics of pipelines in very
 significant ways without users being aware of it. For example, this could
 result in extremely imbalanced workloads for any downstream processors. If
 it's just Hive I would rather work around it (even with a perf penalty for
 that case).

 User would have to explicitly ask FileIO to use specific sharding.
 Documentation can educate about the tradeoffs. But I am open to workaround
 alternatives.


 On Mon, Jun 28, 2021 at 5:50 PM Chamikara Jayalath <
 chamik...@google.com> wrote:

>
>
> On Mon, Jun 28, 2021 at 2:47 AM Jozef Vilcek 
> wrote:
>
>> Hi Cham, thanks for 

Re: FileIO with custom sharding function

2021-07-07 Thread Jozef Vilcek
On Sat, Jul 3, 2021 at 12:55 PM Jan Lukavský  wrote:

>
> I don't think this has anything to do with external shuffle services.
>
>
> Sorry, for stepping into this discussion again, but I don't think this
> statement is 100% correct. What Spark's checkpoint does is that it saves
> intermediate data (prior to shuffle) to external storage so that it can be
> made available in case it is needed? Now, what is the purpose of an
> external shuffle service? Persist data externally to make it available when
> it is needed. It that sense enforcing checkpoint prior to each shuffle
> effectively means, we force Spark to shuffle data twice (ok,
> once-and-a-half-times, the external checkpoint is likely not be read) -
> once using its own internal shuffle and second time the external
> checkpoint. That is not going to be efficient. And if spark could switch
> off its internal shuffle and use the external checkpoint for the same
> purpose, then the checkpoint would play role of external shuffle service,
> which is why this whole discussion has to do something with external
> shuffle services.
>
> To move this discussion forward I think that the solution could be in that
> SparkRunner can override the default FileIO transform and use a
> deterministic sharding function. @Jozef, would that work for you? It would
> mean, that the same should probably have to be done (sooner or later) for
> Flink batch and probably for other runners. Maybe there could be a
> deterministic override ready-made in runners-core. Or, actually, maybe the
> more easy way would be the other way around, that Dataflow would use the
> non-deterministic version, while other runners could use the (more
> conservative, yet not that performant) version.
>
> WDYT?
>
Yes, that could work for the problem with Beam on Spark producing
inconsistent file output when shuffle data is lost. I would still need to
address the "bucketing problem" e.g. wanting data from the same user_id and
hour end up in the same file. It feels ok to tackle those separately.


>  Jan
> On 7/2/21 8:23 PM, Reuven Lax wrote:
>
>
>
> On Tue, Jun 29, 2021 at 1:03 AM Jozef Vilcek 
> wrote:
>
>>
>> How will @RequiresStableInput prevent this situation when running batch
>>> use case?
>>>
>>
>> So this is handled in combination of @RequiresStableInput and output file
>> finalization. @RequiresStableInput (or Reshuffle for most runners) makes
>> sure that the input provided for the write stage does not get recomputed in
>> the presence of failures. Output finalization makes sure that we only
>> finalize one run of each bundle and discard the rest.
>>
>> So this I believe relies on having robust external service to hold
>> shuffle data and serve it out when needed so pipeline does not need to
>> recompute it via non-deterministic function. In Spark however, shuffle
>> service can not be (if I am not mistaking) deployed in this fashion (HA +
>> replicated shuffle data). Therefore, if instance of shuffle service holding
>> a portion of shuffle data fails, spark recovers it by recomputing parts of
>> a DAG from source to recover lost shuffle results. I am not sure what Beam
>> can do here to prevent it or make it stable? Will @RequiresStableInput work
>> as expected? ... Note that this, I believe, concerns only the batch.
>>
>
> I don't think this has anything to do with external shuffle services.
>
> Arbitrarily recomputing data is fundamentally incompatible with Beam,
> since Beam does not restrict transforms to being deterministic. The Spark
> runner works (at least it did last I checked) by checkpointing the RDD.
> Spark will not recompute the DAG past a checkpoint, so this creates stable
> input to a transform. This adds some cost to the Spark runner, but makes
> things correct. You should not have sharding problems due to replay unless
> there is a bug in the current Spark runner.
>
>
>>
>> Also, when withNumShards() truly have to be used, round robin assignment
>> of elements to shards sounds like the optimal solution (at least for
>> the vast majority of pipelines)
>>
>> I agree. But random seed is my problem here with respect to the situation
>> mentioned above.
>>
>> Right, I would like to know if there are more true use-cases before
>> adding this. This essentially allows users to map elements to exact output
>> shards which could change the characteristics of pipelines in very
>> significant ways without users being aware of it. For example, this could
>> result in extremely imbalanced workloads for any downstream processors. If
>> it's just Hive I would rather work around it (even with a perf penalty for
>> that case).
>>
>> User would have to explicitly ask FileIO to use specific sharding.
>> Documentation can educate about the tradeoffs. But I am open to workaround
>> alternatives.
>>
>>
>> On Mon, Jun 28, 2021 at 5:50 PM Chamikara Jayalath 
>> wrote:
>>
>>>
>>>
>>> On Mon, Jun 28, 2021 at 2:47 AM Jozef Vilcek 
>>> wrote:
>>>
 Hi Cham, thanks for the feedback

 > Beam