Re: FileIO with custom sharding function

2021-06-17 Thread Jozef Vilcek
I would like this thread to stay focused on sharding FileIO only. Possible
change to the model is an interesting topic but of a much different scope.

Yes, I agree that sharding is mostly a physical rather than logical
property of the pipeline. That is why it feels more natural to distinguish
between those two on the API level.
As for handling sharding requirements by adding more sugar to dynamic
destinations + file naming one has to keep in mind that results of dynamic
writes can be observed in the form of KV, so written
files per dynamic destination. Often we do GBP to post-process files per
destination / logical group. If sharding would be encoded there, then it
might complicate things either downstream or inside the sugar part to put
shard in and then take it out later.
>From the user perspective I do not see much difference. We would still need
to allow API to define both behaviors and it would only be executed
differently by implementation.
I do not see a value in changing FileIO (WriteFiles) logic to stop using
sharding and use dynamic destination for both given that sharding function
is already there and in use.

To the point of external shuffle and non-deterministic user input.
Yes users can create non-deterministic behaviors but they are in control.
Here, Beam internally adds non-deterministic behavior and users can not
opt-out.
All works fine as long as external shuffle service (depends on Runner)
holds to the data and hands it out on retries. However if data in shuffle
service is lost for some reason - e.g. disk failure, node breaks down -
then pipeline have to recover the data by recomputing necessary paths.

On Thu, Jun 17, 2021 at 7:36 PM Robert Bradshaw  wrote:

> Sharding is typically a physical rather than logical property of the
> pipeline, and I'm not convinced it makes sense to add it to Beam in
> general. One can already use keys and GBK/Stateful DoFns if some kind
> of logical grouping is needed, and adding constraints like this can
> prevent opportunities for optimizations (like dynamic sharding and
> fusion).
>
> That being said, file output are one area where it could make sense. I
> would expect that dynamic destinations could cover this usecase, and a
> general FileNaming subclass could be provided to make this pattern
> easier (and possibly some syntactic sugar for auto-setting num shards
> to 0). (One downside of this approach is that one couldn't do dynamic
> destinations, and have each sharded with a distinct sharing function
> as well.)
>
> If this doesn't work, we could look into adding ShardingFunction as a
> publicly exposed parameter to FileIO. (I'm actually surprised it
> already exists.)
>
> On Thu, Jun 17, 2021 at 9:39 AM  wrote:
> >
> > Alright, but what is worth emphasizing is that we talk about batch
> workloads. The typical scenario is that the output is committed once the
> job finishes (e.g., by atomic rename of directory).
> >  Jan
> >
> > Dne 17. 6. 2021 17:59 napsal uživatel Reuven Lax :
> >
> > Yes - the problem is that Beam makes no guarantees of determinism
> anywhere in the system. User DoFns might be non deterministic, and have no
> way to know (we've discussed proving users with an @IsDeterministic
> annotation, however empirically users often think their functions are
> deterministic when they are in fact not). _Any_ sort of triggered
> aggregation (including watermark based) can always be non deterministic.
> >
> > Even if everything was deterministic, replaying everything is tricky.
> The output files already exist - should the system delete them and recreate
> them, or leave them as is and delete the temp files? Either decision could
> be problematic.
> >
> > On Wed, Jun 16, 2021 at 11:40 PM Jan Lukavský  wrote:
> >
> > Correct, by the external shuffle service I pretty much meant "offloading
> the contents of a shuffle phase out of the system". Looks like that is what
> the Spark's checkpoint does as well. On the other hand (if I understand the
> concept correctly), that implies some performance penalty - the data has to
> be moved to external distributed filesystem. Which then feels weird if we
> optimize code to avoid computing random numbers, but are okay with moving
> complete datasets back and forth. I think in this particular case making
> the Pipeline deterministic - idempotent to be precise - (within the limits,
> yes, hashCode of enum is not stable between JVMs) would seem more practical
> to me.
> >
> >  Jan
> >
> > On 6/17/21 7:09 AM, Reuven Lax wrote:
> >
> > I have some thoughts here, as Eugene Kirpichov and I spent a lot of time
> working through these semantics in the past.
> >
> > First - about the problem of duplicates:
> >
> > A "deterministic" sharding - e.g. hashCode based (though Java makes no
> guarantee that hashCode is stable across JVM instances, so this technique
> ends up not being stable) doesn't really help matters in Beam. Unlike other
> systems, Beam makes no assumptions that transforms are idempotent or
> determinist

Re: Aliasing Pub/Sub Lite IO in external repo

2021-06-17 Thread Tomo Suzuki
Hi Daniel,
(You helped me apply some change to this strange setup a few months back.
Thank you for working on rectifying the situation.)

I like that idea overall.

Question 1: How are you going to approach testing/CI?
The pull requests in the java-pubsublite repo do not trigger Beam repo's
CI. You want to deliver things to your customers after they are tested as
much as possible.


Question2 : in the code below, what is the purpose of keeping the
PubsubLiteIO in the Beam repo?

```
class PubsubLiteIO extends com.google.cloud.pubsublite.beam.PubsubLiteIO {}


The backward compatibility came to my mind but I thought you may have more
reasons.


My memo:
java-pubsublite repsitory has:
https://github.com/googleapis/java-pubsublite/blob/master/pubsublite-beam-io/src/main/java/com/google/cloud/pubsublite/beam/PubsubLiteIO.java
beam repo has:
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsublite/PubsubLiteIO.java
(and other files in the same directory)
google-cloud-pubsublite is not part of the Libraries BOM (yet) because of
its pre-1.0 status.


On Thu, Jun 17, 2021 at 5:07 PM Daniel Collins  wrote:

> I don't know that the cycle would cause a problem- wouldn't it override
> and cause it to use beam-sdks-java-core:2.30.0 (at least until beam goes to
> 3.X.X)?
>
> Something we can do if this is an issue is mark pubsublite-beam-io's dep
> on beam-sdks-java-core as 'provided'. But I'd prefer to avoid this and just
> let overriding fix it if that works.
>
> On Thu, Jun 17, 2021 at 4:15 PM Andrew Pilloud 
> wrote:
>
>> How do you plan to address the circular dependency? Won't this end up
>> with Beam depending on older versions of itself?
>>
>> beam-sdks-java-io-google-cloud-platform:2.30.0 ->
>> pubsublite-beam-io:0.16.0 -> beam-sdks-java-core:2.29.0
>>
>> On Thu, Jun 17, 2021 at 11:56 AM Daniel Collins 
>> wrote:
>>
>>> Hello beam developers,
>>>
>>> I'm the primary author of the Pub/Sub Lite I/O, and I'd like to get some
>>> feedback on a change to the model for hosting this I/O in beam. Our team
>>> has been frustrated by the fact that we have no way to release features or
>>> fixes for bugs to customers on time scales shorter than the 1-2 months of
>>> the beam release cycle, and that those fixes are necessarily coupled with a
>>> beam version upgrade. To work around this, I forked the I/O in beam to our
>>> own repo about 6 months ago and have been maintaining both copies in
>>> parallel.
>>>
>>> I'd like to retain our ability to quickly fix and improve the I/O while
>>> retaining end-user visibility within the beam repo. To do this, I'd like
>>> to remove all the implementation from the beam repo, and leave the I/O
>>> there implemented as:
>>>
>>> ```
>>> class PubsubLiteIO extends com.google.cloud.pubsublite.beam.PubsubLiteIO
>>> {}
>>> 
>>> , and add a dependency on our beam artifact.
>>>
>>> This enables beam users who want to just use the
>>> beam-sdks-java-io-google-cloud-platform artifact to do so, but they can
>>> also track the canonical version separately in our repo to get fixes and
>>> improvements at a faster rate. All static methods from the parent class
>>> would be available on the class in the beam repo.
>>>
>>> I'd be interested to hear anyones thoughts and suggestions surrounding
>>> this.
>>>
>>> -Daniel
>>>
>>

-- 
Regards,
Tomo


Re: [VOTE] Release 2.30.0, release candidate #1

2021-06-17 Thread Andrew Pilloud
For the website I filed https://issues.apache.org/jira/browse/BEAM-12507
For the jars I filed https://issues.apache.org/jira/browse/BEAM-12508

Andrew

On Mon, Jun 14, 2021 at 5:16 PM Justin Mclean  wrote:

> Hi there,
>
> I'm not on your PMC but I took a look at your release and noticed
> something that you might want to look into and correct:
>
> There a couple of item include in the release that are not mentioned in
> your LICENSE file:
> ./beam-2.30.0/website/www/site/static/js/bootstrap.js
> ./beam-2.30.0/website/www/site/static/js/hero/lottie-light.min.js
>
> ./beam-2.30.0/website/www/site/static/fonts/bootstrap/glyphicons-halflings-regular.*
>
> Was the website meant to be include in the release?
>
> The release contains compiled jar files:
>  + ./beam-2.30.0/learning/katas/java/gradle/wrapper/gradle-wrapper.jar
>  + ./beam-2.30.0/learning/katas/kotlin/gradle/wrapper/gradle-wrapper.jar
>
> Kind Regards,
> Justin
>
>


Aliasing Pub/Sub Lite IO in external repo

2021-06-17 Thread Daniel Collins
Hello beam developers,

I'm the primary author of the Pub/Sub Lite I/O, and I'd like to get some
feedback on a change to the model for hosting this I/O in beam. Our team
has been frustrated by the fact that we have no way to release features or
fixes for bugs to customers on time scales shorter than the 1-2 months of
the beam release cycle, and that those fixes are necessarily coupled with a
beam version upgrade. To work around this, I forked the I/O in beam to our
own repo about 6 months ago and have been maintaining both copies in
parallel.

I'd like to retain our ability to quickly fix and improve the I/O while
retaining end-user visibility within the beam repo. To do this, I'd like
to remove all the implementation from the beam repo, and leave the I/O
there implemented as:

```
class PubsubLiteIO extends com.google.cloud.pubsublite.beam.PubsubLiteIO {}

, and add a dependency on our beam artifact.

This enables beam users who want to just use the
beam-sdks-java-io-google-cloud-platform artifact to do so, but they can
also track the canonical version separately in our repo to get fixes and
improvements at a faster rate. All static methods from the parent class
would be available on the class in the beam repo.

I'd be interested to hear anyones thoughts and suggestions surrounding this.

-Daniel


Flaky test issue report (29)

2021-06-17 Thread Beam Jira Bot
This is your daily summary of Beam's current flaky tests 
(https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20statusCategory%20!%3D%20Done%20AND%20labels%20%3D%20flake)

These are P1 issues because they have a major negative impact on the community 
and make it hard to determine the quality of the software.

https://issues.apache.org/jira/browse/BEAM-12490: Python PreCommit flaking 
in SdkWorkerTest.test_harness_monitoring_infos_and_metadata (created 2021-06-14)
https://issues.apache.org/jira/browse/BEAM-12322: 
FnApiRunnerTestWithGrpcAndMultiWorkers flaky (py precommit) (created 2021-05-10)
https://issues.apache.org/jira/browse/BEAM-12309: 
PubSubIntegrationTest.test_streaming_data_only flake (created 2021-05-07)
https://issues.apache.org/jira/browse/BEAM-12307: 
PubSubBigQueryIT.test_file_loads flake (created 2021-05-07)
https://issues.apache.org/jira/browse/BEAM-12303: Flake in 
PubSubIntegrationTest.test_streaming_with_attributes (created 2021-05-06)
https://issues.apache.org/jira/browse/BEAM-12291: 
org.apache.beam.runners.flink.ReadSourcePortableTest.testExecution[streaming: 
false] is flaky (created 2021-05-05)
https://issues.apache.org/jira/browse/BEAM-12200: 
SamzaStoreStateInternalsTest is flaky (created 2021-04-20)
https://issues.apache.org/jira/browse/BEAM-12163: Python GHA PreCommits 
flake with grpc.FutureTimeoutError on SDK harness startup (created 2021-04-13)
https://issues.apache.org/jira/browse/BEAM-12061: beam_PostCommit_SQL 
failing on KafkaTableProviderIT.testFakeNested (created 2021-03-27)
https://issues.apache.org/jira/browse/BEAM-12019: 
apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics
 is flaky (created 2021-03-18)
https://issues.apache.org/jira/browse/BEAM-11792: Python precommit failed 
(flaked?) installing package  (created 2021-02-10)
https://issues.apache.org/jira/browse/BEAM-11666: 
apache_beam.runners.interactive.recording_manager_test.RecordingManagerTest.test_basic_execution
 is flaky (created 2021-01-20)
https://issues.apache.org/jira/browse/BEAM-11661: hdfsIntegrationTest 
flake: network not found (py38 postcommit) (created 2021-01-19)
https://issues.apache.org/jira/browse/BEAM-11645: beam_PostCommit_XVR_Flink 
failing (created 2021-01-15)
https://issues.apache.org/jira/browse/BEAM-11541: 
testTeardownCalledAfterExceptionInProcessElement flakes on direct runner. 
(created 2020-12-30)
https://issues.apache.org/jira/browse/BEAM-10968: flaky test: 
org.apache.beam.sdk.metrics.MetricsTest$AttemptedMetricTests.testAttemptedDistributionMetrics
 (created 2020-09-25)
https://issues.apache.org/jira/browse/BEAM-10955: Flink Java Runner test 
flake: Could not find Flink job  (created 2020-09-23)
https://issues.apache.org/jira/browse/BEAM-10866: 
PortableRunnerTestWithSubprocesses.test_register_finalizations flaky on macOS 
(created 2020-09-09)
https://issues.apache.org/jira/browse/BEAM-10485: Failure / flake: 
ElasticsearchIOTest > testWriteWithIndexFn (created 2020-07-14)
https://issues.apache.org/jira/browse/BEAM-9649: 
beam_python_mongoio_load_test started failing due to mismatched results 
(created 2020-03-31)
https://issues.apache.org/jira/browse/BEAM-9232: 
BigQueryWriteIntegrationTests is flaky coercing to Unicode (created 2020-01-31)
https://issues.apache.org/jira/browse/BEAM-8101: Flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundleStateful for 
Direct, Spark, Flink (created 2019-08-27)
https://issues.apache.org/jira/browse/BEAM-8035: 
WatchTest.testMultiplePollsWithManyResults flake: Outputs must be in timestamp 
order (sickbayed) (created 2019-08-22)
https://issues.apache.org/jira/browse/BEAM-7992: Unhandled type_constraint 
in 
apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types
 (created 2019-08-16)
https://issues.apache.org/jira/browse/BEAM-7827: 
MetricsTest$AttemptedMetricTests.testAllAttemptedMetrics is flaky on 
DirectRunner (created 2019-07-26)
https://issues.apache.org/jira/browse/BEAM-7752: Java Validates 
DirectRunner: testTeardownCalledAfterExceptionInFinishBundleStateful flaky 
(created 2019-07-16)
https://issues.apache.org/jira/browse/BEAM-6804: [beam_PostCommit_Java] 
[PubsubReadIT.testReadPublicData] Timeout waiting on Sub (created 2019-03-11)
https://issues.apache.org/jira/browse/BEAM-5286: 
[beam_PostCommit_Java_GradleBuild][org.apache.beam.examples.subprocess.ExampleEchoPipelineTest.testExampleEchoPipeline][Flake]
 .sh script: text file busy. (created 2018-09-01)
https://issues.apache.org/jira/browse/BEAM-5172: 
org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky (created 
2018-08-20)


P1 issues report (41)

2021-06-17 Thread Beam Jira Bot
This is your daily summary of Beam's current P1 issues, not including flaky 
tests 
(https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20statusCategory%20!%3D%20Done%20AND%20priority%20%3D%20P1%20AND%20(labels%20is%20EMPTY%20OR%20labels%20!%3D%20flake).

See https://beam.apache.org/contribute/jira-priorities/#p1-critical for the 
meaning and expectations around P1 issues.

https://issues.apache.org/jira/browse/BEAM-12500: Dataflow SocketException 
(SSLException) error while trying to send message from Cloud Pub/Sub to 
BigQuery (created 2021-06-16)
https://issues.apache.org/jira/browse/BEAM-12484: JdbcIO date conversion is 
sensitive to OS (created 2021-06-14)
https://issues.apache.org/jira/browse/BEAM-12467: 
java.io.InvalidClassException With Flink Kafka (created 2021-06-09)
https://issues.apache.org/jira/browse/BEAM-12436: 
[beam_PostCommit_Go_VR_flink| beam_PostCommit_Go_VR_spark] 
[:sdks:go:test:flinkValidatesRunner] Failure summary (created 2021-06-01)
https://issues.apache.org/jira/browse/BEAM-12422: Vendored gRPC 1.36.0 is 
using a log4j version with security issues (created 2021-05-28)
https://issues.apache.org/jira/browse/BEAM-12396: 
beam_PostCommit_XVR_Direct failed (flaked?) (created 2021-05-24)
https://issues.apache.org/jira/browse/BEAM-12389: 
beam_PostCommit_XVR_Dataflow flaky: Expand method not found (created 2021-05-21)
https://issues.apache.org/jira/browse/BEAM-12387: beam_PostCommit_Python* 
timing out (created 2021-05-21)
https://issues.apache.org/jira/browse/BEAM-12386: 
beam_PostCommit_Py_VR_Dataflow(_V2) failing metrics tests (created 2021-05-21)
https://issues.apache.org/jira/browse/BEAM-12380: Go SDK Kafka IO Transform 
implemented via XLang (created 2021-05-21)
https://issues.apache.org/jira/browse/BEAM-12374: Spark postcommit failing 
ResumeFromCheckpointStreamingTest (created 2021-05-20)
https://issues.apache.org/jira/browse/BEAM-12320: 
PubsubTableProviderIT.testSQLSelectsArrayAttributes[0] failing in SQL 
PostCommit (created 2021-05-10)
https://issues.apache.org/jira/browse/BEAM-12310: 
beam_PostCommit_Java_DataflowV2 failing (created 2021-05-07)
https://issues.apache.org/jira/browse/BEAM-12279: Implement 
destination-dependent sharding in FileIO.writeDynamic (created 2021-05-04)
https://issues.apache.org/jira/browse/BEAM-12256: 
PubsubIO.readAvroGenericRecord creates SchemaCoder that fails to decode some 
Avro logical types (created 2021-04-29)
https://issues.apache.org/jira/browse/BEAM-12076: Update Python 
cross-language Kafka source to read metadata (created 2021-03-31)
https://issues.apache.org/jira/browse/BEAM-11959: Python Beam SDK Harness 
hangs when installing pip packages (created 2021-03-11)
https://issues.apache.org/jira/browse/BEAM-11906: No trigger early 
repeatedly for session windows (created 2021-03-01)
https://issues.apache.org/jira/browse/BEAM-11875: XmlIO.Read does not 
handle XML encoding per spec (created 2021-02-26)
https://issues.apache.org/jira/browse/BEAM-11828: JmsIO is not 
acknowledging messages correctly (created 2021-02-17)
https://issues.apache.org/jira/browse/BEAM-11755: Cross-language 
consistency (RequiresStableInputs) is quietly broken (at least on portable 
flink runner) (created 2021-02-05)
https://issues.apache.org/jira/browse/BEAM-11578: `dataflow_metrics` 
(python) fails with TypeError (when int overflowing?) (created 2021-01-06)
https://issues.apache.org/jira/browse/BEAM-11434: Expose Spanner 
admin/batch clients in Spanner Accessor (created 2020-12-10)
https://issues.apache.org/jira/browse/BEAM-11148: Kafka 
commitOffsetsInFinalize OOM on Flink (created 2020-10-28)
https://issues.apache.org/jira/browse/BEAM-11017: Timer with dataflow 
runner can be set multiple times (dataflow runner) (created 2020-10-05)
https://issues.apache.org/jira/browse/BEAM-10670: Make non-portable 
Splittable DoFn the only option when executing Java "Read" transforms (created 
2020-08-10)
https://issues.apache.org/jira/browse/BEAM-10617: python 
CombineGlobally().with_fanout() cause duplicate combine results for sliding 
windows (created 2020-07-31)
https://issues.apache.org/jira/browse/BEAM-10569: SpannerIO tests don't 
actually assert anything. (created 2020-07-23)
https://issues.apache.org/jira/browse/BEAM-10529: Kafka XLang fails for 
?empty? key/values (created 2020-07-18)
https://issues.apache.org/jira/browse/BEAM-10288: Quickstart documents are 
out of date (created 2020-06-19)
https://issues.apache.org/jira/browse/BEAM-10244: Populate requirements 
cache fails on poetry-based packages (created 2020-06-11)
https://issues.apache.org/jira/browse/BEAM-10100: FileIO writeDynamic with 
AvroIO.sink not writing all data (created 2020-05-27)
https://issues.apache.org/jira/browse/BEAM-9564: Remove insecure ssl 
options from MongoDBIO (created 2020-03-20)
https://issues.apache.org/jira/browse/BEAM-9455: Envir

Re: FileIO with custom sharding function

2021-06-17 Thread Robert Bradshaw
Sharding is typically a physical rather than logical property of the
pipeline, and I'm not convinced it makes sense to add it to Beam in
general. One can already use keys and GBK/Stateful DoFns if some kind
of logical grouping is needed, and adding constraints like this can
prevent opportunities for optimizations (like dynamic sharding and
fusion).

That being said, file output are one area where it could make sense. I
would expect that dynamic destinations could cover this usecase, and a
general FileNaming subclass could be provided to make this pattern
easier (and possibly some syntactic sugar for auto-setting num shards
to 0). (One downside of this approach is that one couldn't do dynamic
destinations, and have each sharded with a distinct sharing function
as well.)

If this doesn't work, we could look into adding ShardingFunction as a
publicly exposed parameter to FileIO. (I'm actually surprised it
already exists.)

On Thu, Jun 17, 2021 at 9:39 AM  wrote:
>
> Alright, but what is worth emphasizing is that we talk about batch workloads. 
> The typical scenario is that the output is committed once the job finishes 
> (e.g., by atomic rename of directory).
>  Jan
>
> Dne 17. 6. 2021 17:59 napsal uživatel Reuven Lax :
>
> Yes - the problem is that Beam makes no guarantees of determinism anywhere in 
> the system. User DoFns might be non deterministic, and have no way to know 
> (we've discussed proving users with an @IsDeterministic annotation, however 
> empirically users often think their functions are deterministic when they are 
> in fact not). _Any_ sort of triggered aggregation (including watermark based) 
> can always be non deterministic.
>
> Even if everything was deterministic, replaying everything is tricky. The 
> output files already exist - should the system delete them and recreate them, 
> or leave them as is and delete the temp files? Either decision could be 
> problematic.
>
> On Wed, Jun 16, 2021 at 11:40 PM Jan Lukavský  wrote:
>
> Correct, by the external shuffle service I pretty much meant "offloading the 
> contents of a shuffle phase out of the system". Looks like that is what the 
> Spark's checkpoint does as well. On the other hand (if I understand the 
> concept correctly), that implies some performance penalty - the data has to 
> be moved to external distributed filesystem. Which then feels weird if we 
> optimize code to avoid computing random numbers, but are okay with moving 
> complete datasets back and forth. I think in this particular case making the 
> Pipeline deterministic - idempotent to be precise - (within the limits, yes, 
> hashCode of enum is not stable between JVMs) would seem more practical to me.
>
>  Jan
>
> On 6/17/21 7:09 AM, Reuven Lax wrote:
>
> I have some thoughts here, as Eugene Kirpichov and I spent a lot of time 
> working through these semantics in the past.
>
> First - about the problem of duplicates:
>
> A "deterministic" sharding - e.g. hashCode based (though Java makes no 
> guarantee that hashCode is stable across JVM instances, so this technique 
> ends up not being stable) doesn't really help matters in Beam. Unlike other 
> systems, Beam makes no assumptions that transforms are idempotent or 
> deterministic. What's more, _any_ transform that has any sort of triggered 
> grouping (whether the trigger used is watermark based or otherwise) is non 
> deterministic.
>
> Forcing a hash of every element imposed quite a CPU cost; even generating a 
> random number per-element slowed things down too much, which is why the 
> current code generates a random number only in startBundle.
>
> Any runner that does not implement RequiresStableInput will not properly 
> execute FileIO. Dataflow and Flink both support this. I believe that the 
> Spark runner implicitly supports it by manually calling checkpoint as Ken 
> mentioned (unless someone removed that from the Spark runner, but if so that 
> would be a correctness regression). Implementing this has nothing to do with 
> external shuffle services - neither Flink, Spark, nor Dataflow appliance 
> (classic shuffle) have any problem correctly implementing RequiresStableInput.
>
> On Wed, Jun 16, 2021 at 11:18 AM Jan Lukavský  wrote:
>
> I think that the support for @RequiresStableInput is rather limited. AFAIK it 
> is supported by streaming Flink (where it is not needed in this situation) 
> and by Dataflow. Batch runners without external shuffle service that works as 
> in the case of Dataflow have IMO no way to implement it correctly. In the 
> case of FileIO (which do not use @RequiresStableInput as it would not be 
> supported on Spark) the randomness is easily avoidable (hashCode of key?) and 
> I it seems to me it should be preferred.
>
>  Jan
>
> On 6/16/21 6:23 PM, Kenneth Knowles wrote:
>
>
> On Wed, Jun 16, 2021 at 5:18 AM Jan Lukavský  wrote:
>
> Hi,
>
> maybe a little unrelated, but I think we definitely should not use random 
> assignment of shard keys (RandomShardingFunction), at least fo

Re: Java precomit failing, (though no test are failing)

2021-06-17 Thread Alex Amato
Hmm, perhaps it only happens sometimes. The other half of the time I "Run
Java Precommit" on this PR I hit this different failure:

The connection is not obvious to me, if its related to my PR.
https://github.com/apache/beam/pull/14804
I only added some Precondition checks. But I don't see those failing
anywhere.
(Unless something indirect is causing it and stacktrace for that is not
printed, i.e. like in a subprocess).

Any ideas? Are these tests known to be failing right now?

https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/3742/#showFailuresLink

 Test Result (32 failures / +32)
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testWriteScriptedUpsert
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testReadWithMetadata
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testWriteWithIndexFn
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testMaxParallelRequestsPerWindow
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testWriteRetryValidRequest
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testWriteWithMaxBatchSize
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testWriteRetry
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testReadWithQueryString
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testSizes
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testWriteWithMaxBatchSizeBytes
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testWriteWithDocVersion
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testWriteWithAllowableErrors
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testWriteWithTypeFn
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testWriteScriptedUpsert
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testReadWithQueryValueProvider
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testSplit
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testWriteRetryValidRequest
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testWriteWithDocVersion
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testSizes
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testMaxParallelRequestsPerWindow
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testReadWithQueryString
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testWritePartialUpdate
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testWriteWithMaxBatchSizeBytes
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testDefaultRetryPredicate
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testWriteWithIndexFn
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testWriteWithRouting
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testWriteRetry
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testReadWithMetadata
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testWriteFullAddressing
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testWriteWithMaxBatchSize
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testWriteWithIsDeleteFn
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testWrite

On Wed, Jun 16, 2021 at 5:24 PM Robert Burke  wrote:

> Very odd as those paths do resolve now, redirecting to their pkg.go.dev
> paths. Very odd. This feels transient, but it's not clear why that would
> return a 404 vs some other error.
>
> On Wed, 16 Jun 2021 at 15:39, Kyle Weaver  wrote:
>
>> For tasks without structured JUnit output, we have to scroll up / ctrl-f / 
>> grep for more logs. In this case it looks like it was probably a server-side 
>> issue. These links work for me, so I'm assuming the problem has been 
>> resolved.
>>
>>
>> *11:31:04* >* Task :release:go-licenses:java:dockerRun**11:31:04* package 
>> google.golang.org/protobuf/reflect/protoreflect: unrecognized import path 
>> "google.golang.org/protobuf/reflect/protoreflect": reading 
>> https://google.golang.org/protobuf/reflect/protoreflect?go-get=1: 404 Not 
>> Found*11:31:04* package google.golang.org/protobuf/runtime/protoimpl: 
>> unrecognized import path "google.golang.org/protobuf/runtime/protoimpl": 
>> reading https://google.golang.org/protobuf/runtime/protoimpl?go-get=1: 404 
>> Not Found*11:31:04* package google.golang.org/protobuf/types/descriptorpb: 
>> unrecognized import path "google.golang.org/protobuf/types/descriptorpb": 
>> reading https://google.golang.org/protobuf/types/descriptorpb?go-get=1: 404 
>> Not Found*11:31:04* package 
>> google.golang.org/protobuf/types/known/durationpb: unrecognized import path 
>> "google.golang.org/protobuf/types/known/durationpb": reading 
>> https://google.golang.org/protobuf/types/known/durationpb?go-get=1: 404 Not 
>> Found
>>
>>
>>
>> On Wed, Jun 16, 2021 at 2:35 PM Alex Amato  wrote:
>>
>>> For PR: https://github.com/apache/beam/pull/14804
>>>
>>> Is something wrong on this machine? preventing it from running docker?
>>> Seems to happen a few times after a run again as well.
>>>
>>

Re: FileIO with custom sharding function

2021-06-17 Thread je . ik
Alright, but what is worth emphasizing is that we talk about batch workloads. The typical scenario is that the output is committed once the job finishes (e.g., by atomic rename of directory). JanDne 17. 6. 2021 17:59 napsal uživatel Reuven Lax :Yes - the problem is that Beam makes no guarantees of determinism anywhere in the system. User DoFns might be non deterministic, and have no way to know (we've discussed proving users with an @IsDeterministic annotation, however empirically users often think their functions are deterministic when they are in fact not). _Any_ sort of triggered aggregation (including watermark based) can always be non deterministic.Even if everything was deterministic, replaying everything is tricky. The output files already exist - should the system delete them and recreate them, or leave them as is and delete the temp files? Either decision could be problematic.On Wed, Jun 16, 2021 at 11:40 PM Jan Lukavský  wrote:
  

  
  
Correct, by the external shuffle service I pretty much meant
  "offloading the contents of a shuffle phase out of the system".
  Looks like that is what the Spark's checkpoint does as well. On
  the other hand (if I understand the concept correctly), that
  implies some performance penalty - the data has to be moved to
  external distributed filesystem. Which then feels weird if we
  optimize code to avoid computing random numbers, but are okay with
  moving complete datasets back and forth. I think in this
  particular case making the Pipeline deterministic - idempotent to
  be precise - (within the limits, yes, hashCode of enum is not
  stable between JVMs) would seem more practical to me.
 Jan

On 6/17/21 7:09 AM, Reuven Lax wrote:


  
  I have some thoughts here, as Eugene Kirpichov and
I spent a lot of time working through these semantics in the
past.


First - about the problem of duplicates:


A "deterministic" sharding - e.g. hashCode based (though
  Java makes no guarantee that hashCode is stable across JVM
  instances, so this technique ends up not being stable) doesn't
  really help matters in Beam. Unlike other systems, Beam makes
  no assumptions that transforms are idempotent or
  deterministic. What's more, _any_ transform that has any sort
  of triggered grouping (whether the trigger used is watermark
  based or otherwise) is non deterministic.


Forcing a hash of every element imposed quite a CPU cost;
  even generating a random number per-element slowed things down
  too much, which is why the current code generates a random
  number only in startBundle. 


Any runner that does not implement RequiresStableInput will
  not properly execute FileIO. Dataflow and Flink both support
  this. I believe that the Spark runner implicitly supports it
  by manually calling checkpoint as Ken mentioned (unless
  someone removed that from the Spark runner, but if so that
  would be a correctness regression). Implementing this has
  nothing to do with external shuffle services - neither Flink,
  Spark, nor Dataflow appliance (classic shuffle) have any
  problem correctly implementing RequiresStableInput.
  
  
  
On Wed, Jun 16, 2021 at 11:18
  AM Jan Lukavský  wrote:


  
I think that the support for @RequiresStableInput is
  rather limited. AFAIK it is supported by streaming Flink
  (where it is not needed in this situation) and by
  Dataflow. Batch runners without external shuffle service
  that works as in the case of Dataflow have IMO no way to
  implement it correctly. In the case of FileIO (which do
  not use @RequiresStableInput as it would not be supported
  on Spark) the randomness is easily avoidable (hashCode of
  key?) and I it seems to me it should be preferred.
 Jan

On 6/16/21 6:23 PM, Kenneth Knowles wrote:


  



  On Wed, Jun 16, 2021
at 5:18 AM Jan Lukavský 
wrote:
  
  

  Hi,
  maybe a little unrelated, but I think we
definitely should not use random assignment of
shard keys (RandomShardingFunction), at least
for bounded workloads (seems to be fine for
streaming workloads). Many batch runners simply
recompute path in the com

Re: [PROPOSAL] Stable URL for "current" API Documentation

2021-06-17 Thread Cristian Constantinescu
Big +1 here. In the past few days I've replaced the 2.*.0 part of the
google found javadoc url with 2.29.0 more times than I could count. I
should have made a pipeline with a session window to count those
replacements them though :P

On Thu, Jun 17, 2021 at 12:18 PM Robert Bradshaw 
wrote:

> This makes a lot of sense to me.
>
> On Thu, Jun 17, 2021 at 9:03 AM Brian Hulette  wrote:
> >
> > Hi everyone,
> > You may have noticed that our API Documentation could really use some
> SEO. It's possible to search for Beam APIs (e.g. "beam dataframe read_csv"
> [1] or "beam java ParquetIO" [2]) and you will be directed to some
> documentation, but it almost always takes you to an old version. I think
> this could be significantly improved if we just make one change: rather
> than making https://beam.apache.org/releases/javadoc/current redirect to
> the latest release, we should just always stage the latest documentation
> there.
> >
> > To be clear I'm not 100% sure this will help. I haven't talked to any
> search engineers or SEO experts about it. I'm just looking at other
> projects as a point of reference. I've found that I never have trouble
> finding the latest pandas documentation (e.g. "pandas read_csv" [3]) since
> it always directs to "pandas-docs/stable/" rather than a particular version
> number.
> >
> > We should also make sure the version number shows up in the page title,
> it looks like this isn't the case for Python right now.
> >
> > Would there be any objections to making this change?
> >
> > Also are there thoughts on how to make the change? Presumably this is
> something we'd have to update in the release process.
> >
> > Thanks,
> > Brian
> >
> > [1]
> https://beam.apache.org/releases/pydoc/2.25.0/apache_beam.dataframe.io.html
> > [2]
> https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/io/parquet/ParquetIO.html
> > [3]
> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
>


Re: [PROPOSAL] Stable URL for "current" API Documentation

2021-06-17 Thread Robert Bradshaw
This makes a lot of sense to me.

On Thu, Jun 17, 2021 at 9:03 AM Brian Hulette  wrote:
>
> Hi everyone,
> You may have noticed that our API Documentation could really use some SEO. 
> It's possible to search for Beam APIs (e.g. "beam dataframe read_csv" [1] or 
> "beam java ParquetIO" [2]) and you will be directed to some documentation, 
> but it almost always takes you to an old version. I think this could be 
> significantly improved if we just make one change: rather than making 
> https://beam.apache.org/releases/javadoc/current redirect to the latest 
> release, we should just always stage the latest documentation there.
>
> To be clear I'm not 100% sure this will help. I haven't talked to any search 
> engineers or SEO experts about it. I'm just looking at other projects as a 
> point of reference. I've found that I never have trouble finding the latest 
> pandas documentation (e.g. "pandas read_csv" [3]) since it always directs to 
> "pandas-docs/stable/" rather than a particular version number.
>
> We should also make sure the version number shows up in the page title, it 
> looks like this isn't the case for Python right now.
>
> Would there be any objections to making this change?
>
> Also are there thoughts on how to make the change? Presumably this is 
> something we'd have to update in the release process.
>
> Thanks,
> Brian
>
> [1] 
> https://beam.apache.org/releases/pydoc/2.25.0/apache_beam.dataframe.io.html
> [2] 
> https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/io/parquet/ParquetIO.html
> [3] 
> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html


[PROPOSAL] Stable URL for "current" API Documentation

2021-06-17 Thread Brian Hulette
Hi everyone,
You may have noticed that our API Documentation could really use some SEO.
It's possible to search for Beam APIs (e.g. "beam dataframe read_csv" [1]
or "beam java ParquetIO" [2]) and you will be directed to some
documentation, but it almost always takes you to an old version. I think
this could be significantly improved if we just make one change: rather
than making https://beam.apache.org/releases/javadoc/current redirect to
the latest release, we should just always stage the latest documentation
there.

To be clear I'm not 100% sure this will help. I haven't talked to any
search engineers or SEO experts about it. I'm just looking at other
projects as a point of reference. I've found that I never have trouble
finding the latest pandas documentation (e.g. "pandas read_csv" [3]) since
it always directs to "pandas-docs/stable/" rather than a particular version
number.

We should also make sure the version number shows up in the page title, it
looks like this isn't the case for Python right now.

Would there be any objections to making this change?

Also are there thoughts on how to make the change? Presumably this is
something we'd have to update in the release process.

Thanks,
Brian

[1]
https://beam.apache.org/releases/pydoc/2.25.0/apache_beam.dataframe.io.html
[2]
https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/io/parquet/ParquetIO.html
[3]
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html


Re: FileIO with custom sharding function

2021-06-17 Thread Reuven Lax
Yes - the problem is that Beam makes no guarantees of determinism anywhere
in the system. User DoFns might be non deterministic, and have no way to
know (we've discussed proving users with an @IsDeterministic annotation,
however empirically users often think their functions are deterministic
when they are in fact not). _Any_ sort of triggered aggregation (including
watermark based) can always be non deterministic.

Even if everything was deterministic, replaying everything is tricky. The
output files already exist - should the system delete them and recreate
them, or leave them as is and delete the temp files? Either decision could
be problematic.

On Wed, Jun 16, 2021 at 11:40 PM Jan Lukavský  wrote:

> Correct, by the external shuffle service I pretty much meant "offloading
> the contents of a shuffle phase out of the system". Looks like that is what
> the Spark's checkpoint does as well. On the other hand (if I understand the
> concept correctly), that implies some performance penalty - the data has to
> be moved to external distributed filesystem. Which then feels weird if we
> optimize code to avoid computing random numbers, but are okay with moving
> complete datasets back and forth. I think in this particular case making
> the Pipeline deterministic - idempotent to be precise - (within the limits,
> yes, hashCode of enum is not stable between JVMs) would seem more practical
> to me.
>
>  Jan
> On 6/17/21 7:09 AM, Reuven Lax wrote:
>
> I have some thoughts here, as Eugene Kirpichov and I spent a lot of time
> working through these semantics in the past.
>
> First - about the problem of duplicates:
>
> A "deterministic" sharding - e.g. hashCode based (though Java makes no
> guarantee that hashCode is stable across JVM instances, so this technique
> ends up not being stable) doesn't really help matters in Beam. Unlike other
> systems, Beam makes no assumptions that transforms are idempotent or
> deterministic. What's more, _any_ transform that has any sort of triggered
> grouping (whether the trigger used is watermark based or otherwise) is non
> deterministic.
>
> Forcing a hash of every element imposed quite a CPU cost; even generating
> a random number per-element slowed things down too much, which is why the
> current code generates a random number only in startBundle.
>
> Any runner that does not implement RequiresStableInput will not properly
> execute FileIO. Dataflow and Flink both support this. I believe that the
> Spark runner implicitly supports it by manually calling checkpoint as Ken
> mentioned (unless someone removed that from the Spark runner, but if so
> that would be a correctness regression). Implementing this has nothing to
> do with external shuffle services - neither Flink, Spark, nor Dataflow
> appliance (classic shuffle) have any problem correctly implementing
> RequiresStableInput.
>
> On Wed, Jun 16, 2021 at 11:18 AM Jan Lukavský  wrote:
>
>> I think that the support for @RequiresStableInput is rather limited.
>> AFAIK it is supported by streaming Flink (where it is not needed in this
>> situation) and by Dataflow. Batch runners without external shuffle service
>> that works as in the case of Dataflow have IMO no way to implement it
>> correctly. In the case of FileIO (which do not use @RequiresStableInput as
>> it would not be supported on Spark) the randomness is easily avoidable
>> (hashCode of key?) and I it seems to me it should be preferred.
>>
>>  Jan
>> On 6/16/21 6:23 PM, Kenneth Knowles wrote:
>>
>>
>> On Wed, Jun 16, 2021 at 5:18 AM Jan Lukavský  wrote:
>>
>>> Hi,
>>>
>>> maybe a little unrelated, but I think we definitely should not use
>>> random assignment of shard keys (RandomShardingFunction), at least for
>>> bounded workloads (seems to be fine for streaming workloads). Many batch
>>> runners simply recompute path in the computation DAG from the failed node
>>> (transform) to the root (source). In the case there is any non-determinism
>>> involved in the logic, then it can result in duplicates (as the 'previous'
>>> attempt might have ended in DAG path that was not affected by the fail).
>>> That addresses the option 2) of what Jozef have mentioned.
>>>
>> This is the reason we introduced "@RequiresStableInput".
>>
>> When things were only Dataflow, we knew that each shuffle was a
>> checkpoint, so we inserted a Reshuffle after the random numbers were
>> generated, freezing them so it was safe for replay. Since other engines do
>> not checkpoint at every shuffle, we needed a way to communicate that this
>> checkpointing was required for correctness. I think we still have many IOs
>> that are written using Reshuffle instead of @RequiresStableInput, and I
>> don't know which runners process @RequiresStableInput properly.
>>
>> By the way, I believe the SparkRunner explicitly calls materialize()
>> after a GBK specifically so that it gets correct results for IOs that rely
>> on Reshuffle. Has that changed?
>>
>> I agree that we should minimize use of RequiresStab

Re: [EXTERNAL] [EXTERNAL]

2021-06-17 Thread Alexey Romanenko


> On 15 Jun 2021, at 22:59, Raphael Sanamyan  
> wrote:
> 
> Hello,
> 
>> Is it somehow related to this work [1]? 
> 
> 
> No, this work adds the ability to return values from a sql insert query. 
> There are no improvements to work with row and schema in it.
> 
>> Not sure that I got it. Could you elaborate a bit on this? 
> 
> 
> When we using "Write" with table and without statement, "Write.expand" is 
> called, which automatically generates statement and provides input to 
> "WriteVoid.expand", but when we using "Write.withResults", only 
> "WriteVoid.expand" is called, which can't automatically generate statement. 
> If we add conditions there similar to those in "Write.Expand" and move the 
> statement generation in "WriteVoid.expand", we'll fix this case

Yes, I think we can do it there in the same way as we do for "Write.expand()". 

> I analyzed the Write class again and it seems to be the only case where there 
> is no full support for automatic work with "row". I think it makes sense to 
> delete todo 
> 
>  and close the the task , 
> to not confuse people. And create a task, to solve this case. What do you 
> think about that?

I agree on this. 

Back to https://github.com/apache/beam/pull/14856/
It should kind of replace WriteVoid since it does the same job but also returns 
the results of write and I suggested to deprecate WriteVoid. So, we will need 
to add automatic statement generating there too.

—
Alexey

Re: [Proposal] Go SDK Exits Experimental

2021-06-17 Thread Robert Burke
Yup!

My immediate plan is to work on incorporating the Go SDK fully into the
Beam Programming Guide. I've audited the guide, and
am beginning to add missing content and filling in the Go specific gaps.
This will be tied to improving the Go Doc with more Go
specific user documentation that isn't appropriate for the BPG.

My audit of the guide is here:
https://docs.google.com/spreadsheets/d/1DrBFjxPBmMMmPfeFr6jr_JndxGOes8qDqKZ2Uxwvvds/edit?resourcekey=0-tVFwcLrQ2v2jpZkHk6QOpQ#gid=2072310090

The other sheets focus on features and tests. The feature page looks worse
than it is, as it was more productive to focus on what isn't available than
what is. That's a snapshot of my actual working sheet but I'll be updating
it as needed.

On Thu, Jun 17, 2021, 6:23 AM Ismaël Mejía  wrote:

> Oups forgot to write one question. Will this come with revamped
> website instructions/doc for golang too?
>
> On Thu, Jun 17, 2021 at 3:21 PM Ismaël Mejía  wrote:
> >
> > Huge +1
> >
> > This is definitely something many people have asked about, so it is
> > great to see it finally happening.
> >
> > On Wed, Jun 16, 2021 at 7:56 PM Kenneth Knowles  wrote:
> > >
> > > +1 awesome
> > >
> > > On Wed, Jun 16, 2021 at 10:33 AM Robert Burke 
> wrote:
> > >>
> > >> Sounds reasonable to me. I agree. We'll aim to get those (Go modules
> and LICENSE issue) done before the 2.32 cut, and certainly before the 2.33
> cut if release images aren't added to the 2.32 process.
> > >>
> > >> Regarding Go Generics: at some point in the future, we may want a
> harder break between a newer Generic first API and and the current version,
> but there's no rush. Generics/TypeParameters in Go aren't identical to the
> feature referred to by that term in Java, C++, Rust, etc, so it'll take a
> bit of time for that expertise to develop.
> > >>
> > >> However, by the current nature of Go, we had to have pretty
> sophisticated reflective analysis to handle DoFns and map them to their
> graph inputs. So, adding new helpers like a KV, emitter, and Iterator
> types, shouldn't be too difficult. Changing Go SDK internals to use
> generics (like the implementation of Stats DoFns like Min, Max, etc) would
> also be able to be made transparently to most users, and certainly any of
> the framework for execution time handling (the "worker's SDK harness")
> would be able to be cleaned up if need be. Finally, adding more
> sophisticated DoFn registration and code generation would be able to
> replace the optional code generator entirely, saving some users a `go
> generate` step, simplifying getting improved execution performance.
> > >>
> > >> Changing things like making a Type Parameterized PCollection, would
> be far more involved, as would trying to use some kind of Apply format. The
> lack of Method Overrides prevents the apply chaining approach. Or at least
> prevents it from working simply.
> > >>
> > >> Finally, Go Generics won't be available until Go 1.18, which isn't
> until next year. See https://blog.golang.org/generics-proposal for
> details.
> > >>
> > >> Go 1.17 https://tip.golang.org/doc/go1.17 does include a Register
> calling convention, leading to a modest performance improvement across the
> board.
> > >>
> > >> Cheers,
> > >> Robert Burke
> > >>
> > >> On 2021/06/15 18:10:46, Robert Bradshaw  wrote:
> > >> > +1 to declaring Golang support out of experimental once the Go
> Modules
> > >> > issues are solved. I don't think an SDK needs to support every
> feature
> > >> > to be accepted, especially now that we can do cross-language
> > >> > transforms, and Go definitely supports enough to be quite useful.
> (WRT
> > >> > streaming, my understanding is that Go supports the streaming model
> > >> > with windows and timestamps, and runs fine on a streaming runner,
> even
> > >> > if more advanced features like state and timers aren't yet
> available.)
> > >> >
> > >> > This is a great milestone.
> > >> >
> > >> > On Tue, Jun 15, 2021 at 10:12 AM Tyson Hamilton 
> wrote:
> > >> > >
> > >> > > WOW! Big news.
> > >> > >
> > >> > > I'm supportive of leaving experimental status after Go Modules
> are completed and the LICENSE issue is resolved. I don't think that lacking
> streaming support is a blocker. The other thing I checked to see was if
> there were metrics available on metrics.beam.apache.org, specifically for
> measuring code health via post-commit over time, which there are and the
> passing test rate is high (Huzzah!). The one thing that surprised me from
> your summary is that when Go introduces generics it won't result in any
> backwards incompatible changes in Apache Beam. That's great news, but does
> it mean there will be a need to support both non-generic and generic APIs
> moving forward? It seems like generics will be introduced in the Go 1.17
> release (optimistically) in August this year.
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Thu, Jun 10, 2021 at 5:04 PM Robert Burke 
> wrote:
> > >> > >>
> > >> > >> Hello Beam Community!
> > >> > >>

CometD IO Connector for Beam

2021-06-17 Thread ZAFFALON Mattia - NTTDATA
Hi,

I'm reaching out to ask you a couple of questions regarding the existence, and 
eventually the state, of a connector for CometD (https://docs.cometd.org/), 
especially when used as publish/subscribe mechanism.

You may have already heard about CometD, but, just to provide some context to 
this email, it is the way Salesforce delivers change data capture events to 
other systems. Here is the documentation from SF developer portal 
https://developer.salesforce.com/docs/atlas.en-us.218.0.change_data_capture.meta/change_data_capture/cdc_intro.htm,
 and this in particular is the section of CometD documentation that describe 
the architecture of the publish/subscribe interaction that can be used with 
CometD: 
https://docs.cometd.org/current7/reference/#_concepts_channels_broadcast.

There is a pretty straightforward example of client (from Salesforce, 
maintained by a community) in this repo: 
https://github.com/forcedotcom/EMP-Connector.

Given that the job I'm going to build seems to fit with the Beam/Dataflow 
streaming job model, I'd like to ask:

  *   It looks like there is nothing around the web and github regarding Beam 
CometD connectors, do you know if there is anything in progress?
  *   Since I'm planning to build (at least the Source part of) the connector, 
could you please give me some advice about using SplittableDoFns (as it kind of 
suggested in Beam documentation), or an UnboundeSource? It seems to me that for 
each subscription to one CometD channel there can be at most one consumer, 
hence I don't know if a splittable do function can be used here. On the other 
hand, the UnboundedSource with just one split seems to me adapting much better.

Would you please share your ideas?

Best regards,
Mattia


Re: [Proposal] Go SDK Exits Experimental

2021-06-17 Thread Ismaël Mejía
Oups forgot to write one question. Will this come with revamped
website instructions/doc for golang too?

On Thu, Jun 17, 2021 at 3:21 PM Ismaël Mejía  wrote:
>
> Huge +1
>
> This is definitely something many people have asked about, so it is
> great to see it finally happening.
>
> On Wed, Jun 16, 2021 at 7:56 PM Kenneth Knowles  wrote:
> >
> > +1 awesome
> >
> > On Wed, Jun 16, 2021 at 10:33 AM Robert Burke  wrote:
> >>
> >> Sounds reasonable to me. I agree. We'll aim to get those (Go modules and 
> >> LICENSE issue) done before the 2.32 cut, and certainly before the 2.33 cut 
> >> if release images aren't added to the 2.32 process.
> >>
> >> Regarding Go Generics: at some point in the future, we may want a harder 
> >> break between a newer Generic first API and and the current version, but 
> >> there's no rush. Generics/TypeParameters in Go aren't identical to the 
> >> feature referred to by that term in Java, C++, Rust, etc, so it'll take a 
> >> bit of time for that expertise to develop.
> >>
> >> However, by the current nature of Go, we had to have pretty sophisticated 
> >> reflective analysis to handle DoFns and map them to their graph inputs. 
> >> So, adding new helpers like a KV, emitter, and Iterator types, shouldn't 
> >> be too difficult. Changing Go SDK internals to use generics (like the 
> >> implementation of Stats DoFns like Min, Max, etc) would also be able to be 
> >> made transparently to most users, and certainly any of the framework for 
> >> execution time handling (the "worker's SDK harness") would be able to be 
> >> cleaned up if need be. Finally, adding more sophisticated DoFn 
> >> registration and code generation would be able to replace the optional 
> >> code generator entirely, saving some users a `go generate` step, 
> >> simplifying getting improved execution performance.
> >>
> >> Changing things like making a Type Parameterized PCollection, would be far 
> >> more involved, as would trying to use some kind of Apply format. The lack 
> >> of Method Overrides prevents the apply chaining approach. Or at least 
> >> prevents it from working simply.
> >>
> >> Finally, Go Generics won't be available until Go 1.18, which isn't until 
> >> next year. See https://blog.golang.org/generics-proposal for details.
> >>
> >> Go 1.17 https://tip.golang.org/doc/go1.17 does include a Register calling 
> >> convention, leading to a modest performance improvement across the board.
> >>
> >> Cheers,
> >> Robert Burke
> >>
> >> On 2021/06/15 18:10:46, Robert Bradshaw  wrote:
> >> > +1 to declaring Golang support out of experimental once the Go Modules
> >> > issues are solved. I don't think an SDK needs to support every feature
> >> > to be accepted, especially now that we can do cross-language
> >> > transforms, and Go definitely supports enough to be quite useful. (WRT
> >> > streaming, my understanding is that Go supports the streaming model
> >> > with windows and timestamps, and runs fine on a streaming runner, even
> >> > if more advanced features like state and timers aren't yet available.)
> >> >
> >> > This is a great milestone.
> >> >
> >> > On Tue, Jun 15, 2021 at 10:12 AM Tyson Hamilton  
> >> > wrote:
> >> > >
> >> > > WOW! Big news.
> >> > >
> >> > > I'm supportive of leaving experimental status after Go Modules are 
> >> > > completed and the LICENSE issue is resolved. I don't think that 
> >> > > lacking streaming support is a blocker. The other thing I checked to 
> >> > > see was if there were metrics available on metrics.beam.apache.org, 
> >> > > specifically for measuring code health via post-commit over time, 
> >> > > which there are and the passing test rate is high (Huzzah!). The one 
> >> > > thing that surprised me from your summary is that when Go introduces 
> >> > > generics it won't result in any backwards incompatible changes in 
> >> > > Apache Beam. That's great news, but does it mean there will be a need 
> >> > > to support both non-generic and generic APIs moving forward? It seems 
> >> > > like generics will be introduced in the Go 1.17 release 
> >> > > (optimistically) in August this year.
> >> > >
> >> > >
> >> > >
> >> > > On Thu, Jun 10, 2021 at 5:04 PM Robert Burke  
> >> > > wrote:
> >> > >>
> >> > >> Hello Beam Community!
> >> > >>
> >> > >> I propose we stop calling the Apache Beam Go SDK experimental.
> >> > >>
> >> > >> This thread is to discuss it as a community, and any conditions that 
> >> > >> remain that would prevent the exit.
> >> > >>
> >> > >> tl;dr;
> >> > >> Ask Questions for answers and links! I have both.
> >> > >> This entails including it officially in the Release process, removing 
> >> > >> the various "experimental" text throughout the repo etc,
> >> > >> and otherwise treating it like Python and Java. Some Go specific 
> >> > >> tasks around dep versioning.
> >> > >>
> >> > >> The Go SDK implements the beam model efficiently for most batch 
> >> > >> tasks, including basic windowing.
> >> > >> Apache Beam Go jobs can ex

Re: [Proposal] Go SDK Exits Experimental

2021-06-17 Thread Ismaël Mejía
Huge +1

This is definitely something many people have asked about, so it is
great to see it finally happening.

On Wed, Jun 16, 2021 at 7:56 PM Kenneth Knowles  wrote:
>
> +1 awesome
>
> On Wed, Jun 16, 2021 at 10:33 AM Robert Burke  wrote:
>>
>> Sounds reasonable to me. I agree. We'll aim to get those (Go modules and 
>> LICENSE issue) done before the 2.32 cut, and certainly before the 2.33 cut 
>> if release images aren't added to the 2.32 process.
>>
>> Regarding Go Generics: at some point in the future, we may want a harder 
>> break between a newer Generic first API and and the current version, but 
>> there's no rush. Generics/TypeParameters in Go aren't identical to the 
>> feature referred to by that term in Java, C++, Rust, etc, so it'll take a 
>> bit of time for that expertise to develop.
>>
>> However, by the current nature of Go, we had to have pretty sophisticated 
>> reflective analysis to handle DoFns and map them to their graph inputs. So, 
>> adding new helpers like a KV, emitter, and Iterator types, shouldn't be too 
>> difficult. Changing Go SDK internals to use generics (like the 
>> implementation of Stats DoFns like Min, Max, etc) would also be able to be 
>> made transparently to most users, and certainly any of the framework for 
>> execution time handling (the "worker's SDK harness") would be able to be 
>> cleaned up if need be. Finally, adding more sophisticated DoFn registration 
>> and code generation would be able to replace the optional code generator 
>> entirely, saving some users a `go generate` step, simplifying getting 
>> improved execution performance.
>>
>> Changing things like making a Type Parameterized PCollection, would be far 
>> more involved, as would trying to use some kind of Apply format. The lack of 
>> Method Overrides prevents the apply chaining approach. Or at least prevents 
>> it from working simply.
>>
>> Finally, Go Generics won't be available until Go 1.18, which isn't until 
>> next year. See https://blog.golang.org/generics-proposal for details.
>>
>> Go 1.17 https://tip.golang.org/doc/go1.17 does include a Register calling 
>> convention, leading to a modest performance improvement across the board.
>>
>> Cheers,
>> Robert Burke
>>
>> On 2021/06/15 18:10:46, Robert Bradshaw  wrote:
>> > +1 to declaring Golang support out of experimental once the Go Modules
>> > issues are solved. I don't think an SDK needs to support every feature
>> > to be accepted, especially now that we can do cross-language
>> > transforms, and Go definitely supports enough to be quite useful. (WRT
>> > streaming, my understanding is that Go supports the streaming model
>> > with windows and timestamps, and runs fine on a streaming runner, even
>> > if more advanced features like state and timers aren't yet available.)
>> >
>> > This is a great milestone.
>> >
>> > On Tue, Jun 15, 2021 at 10:12 AM Tyson Hamilton  wrote:
>> > >
>> > > WOW! Big news.
>> > >
>> > > I'm supportive of leaving experimental status after Go Modules are 
>> > > completed and the LICENSE issue is resolved. I don't think that lacking 
>> > > streaming support is a blocker. The other thing I checked to see was if 
>> > > there were metrics available on metrics.beam.apache.org, specifically 
>> > > for measuring code health via post-commit over time, which there are and 
>> > > the passing test rate is high (Huzzah!). The one thing that surprised me 
>> > > from your summary is that when Go introduces generics it won't result in 
>> > > any backwards incompatible changes in Apache Beam. That's great news, 
>> > > but does it mean there will be a need to support both non-generic and 
>> > > generic APIs moving forward? It seems like generics will be introduced 
>> > > in the Go 1.17 release (optimistically) in August this year.
>> > >
>> > >
>> > >
>> > > On Thu, Jun 10, 2021 at 5:04 PM Robert Burke  wrote:
>> > >>
>> > >> Hello Beam Community!
>> > >>
>> > >> I propose we stop calling the Apache Beam Go SDK experimental.
>> > >>
>> > >> This thread is to discuss it as a community, and any conditions that 
>> > >> remain that would prevent the exit.
>> > >>
>> > >> tl;dr;
>> > >> Ask Questions for answers and links! I have both.
>> > >> This entails including it officially in the Release process, removing 
>> > >> the various "experimental" text throughout the repo etc,
>> > >> and otherwise treating it like Python and Java. Some Go specific tasks 
>> > >> around dep versioning.
>> > >>
>> > >> The Go SDK implements the beam model efficiently for most batch tasks, 
>> > >> including basic windowing.
>> > >> Apache Beam Go jobs can execute, and are tested on all Portable runners.
>> > >> The core APIs are not going to change in incompatible ways going 
>> > >> forward.
>> > >> Scalable transforms can be written through SplittableDoFns or via Cross 
>> > >> Language transforms.
>> > >>
>> > >> The SDK isn't 100% feature complete, but keeping it experimental 
>> > >> doesn't help with that any further.
>>