date:20181023

See 


Changes:

[kedin] Fix java-harness build by adding flush() to

--
Started by timer
[EnvInject] - Loading node environment variables.
Building remotely on beam2 (beam) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/beam.git # timeout=10
Fetching upstream changes from https://github.com/apache/beam.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/beam.git 
 > +refs/heads/*:refs/remotes/origin/* 
 > +refs/pull/${ghprbPullId}/*:refs/remotes/origin/pr/${ghprbPullId}/*
 > git rev-parse origin/master^{commit} # timeout=10
Checking out Revision 7800c3078d8ecaee7d2e789f02b759e579263249 (origin/master)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 7800c3078d8ecaee7d2e789f02b759e579263249
Commit message: "Merge pull request #6807: [BEAM-5833] Fix java-harness build 
by adding flush() to BeamFnDataWriteRunnerTest"
 > git rev-list --no-walk 5e603ad4c642cfba0a6db70abd05ed8e9d89c7d6 # timeout=10
Cleaning workspace
 > git rev-parse --verify HEAD # timeout=10
Resetting working tree
 > git reset --hard # timeout=10
 > git clean -fdx # timeout=10
Processing DSL script job_00_seed.groovy
Processing DSL script job_Dependency_Check.groovy
Processing DSL script job_Inventory.groovy
Processing DSL script job_PerformanceTests_Dataflow.groovy
Processing DSL script job_PerformanceTests_FileBasedIO_IT.groovy
Processing DSL script job_PerformanceTests_FileBasedIO_IT_HDFS.groovy
Processing DSL script job_PerformanceTests_HadoopInputFormat.groovy
Processing DSL script job_PerformanceTests_JDBC.groovy
Processing DSL script job_PerformanceTests_MongoDBIO_IT.groovy
Processing DSL script job_PerformanceTests_Python.groovy
Processing DSL script job_PerformanceTests_Spark.groovy
Processing DSL script job_PostCommit_Go_GradleBuild.groovy
Processing DSL script job_PostCommit_Java_GradleBuild.groovy
Processing DSL script job_PostCommit_Java_Nexmark_Dataflow.groovy
Processing DSL script job_PostCommit_Java_Nexmark_Direct.groovy
Processing DSL script job_PostCommit_Java_Nexmark_Flink.groovy
Processing DSL script job_PostCommit_Java_Nexmark_Spark.groovy
Processing DSL script job_PostCommit_Java_PortableValidatesRunner_Flink.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Apex.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Dataflow.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Flink.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Gearpump.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Samza.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Spark.groovy
Processing DSL script job_PostCommit_Python_ValidatesContainer_Dataflow.groovy
Processing DSL script job_PostCommit_Python_ValidatesRunner_Dataflow.groovy
Processing DSL script job_PostCommit_Python_ValidatesRunner_Flink.groovy
Processing DSL script job_PostCommit_Python_Verify.groovy
Processing DSL script job_PostCommit_Website_Publish.groovy
Processing DSL script job_PostRelease_NightlySnapshot.groovy
Processing DSL script job_PreCommit_CommunityMetrics.groovy
Processing DSL script job_PreCommit_Go.groovy
Processing DSL script job_PreCommit_Java.groovy
Processing DSL script job_PreCommit_Python.groovy
Processing DSL script job_PreCommit_RAT.groovy
ERROR: startup failed:
job_PreCommit_RAT.groovy: 25: unexpected token: } @ line 25, column 1.
   }
   ^

1 error

Not sending mail to unregistered user ke...@google.com

Re: KafkaIO - Deadletter output

2018-10-23 Thread Raghu Angadi

User can read serialized bytes from KafkaIO and deserialize explicitly in a
ParDo, which gives complete control on how to handle record errors. This is
I would do if I need to in my pipeline.

If there is a transform in Beam that does this, it could be convenient for
users in many such scenarios. This is simpler than each source supporting
it explicitly.

On Tue, Oct 23, 2018 at 8:03 PM Chamikara Jayalath 
wrote:

> Given that KafkaIO uses UnboundeSource framework, this is probably not
> something that can easily be supported. We might be able to support similar
> features when we have Kafka on top of Splittable DoFn though.
>
So feel free to create a feature request JIRA for this.
>
> Thanks,
> Cham
>
> On Tue, Oct 23, 2018 at 7:43 PM Kenneth Knowles  wrote:
>
>> This is a great question. I've added the dev list to be sure it gets
>> noticed by whoever may know best.
>>
>> Kenn
>>
>> On Tue, Oct 23, 2018 at 2:05 AM Kaymak, Tobias 
>> wrote:
>>
>>>
>>> Hi,
>>>
>>> Is there a way to get a Deadletter Output from a pipeline that uses a
>>> KafkaIO
>>> connector for it's input? As `TimestampPolicyFactory.withTimestampFn()`
>>> takes
>>> only a SerializableFunction and not a ParDo, how would I be able to
>>> produce a
>>> Deadletter output from it?
>>>
>>> I have the following pipeline defined that reads from a KafkaIO input:
>>>
>>> pipeline.apply(
>>>   KafkaIO.read()
>>> .withBootstrapServers(bootstrap)
>>> .withTopics(topics)
>>> .withKeyDeserializer(StringDeserializer.class)
>>> .withValueDeserializer(ConfigurableDeserializer.class)
>>> .updateConsumerProperties(
>>> ImmutableMap.of(InputMessagesConfig.CONFIG_PROPERTY_NAME,
>>> inputMessagesConfig))
>>> .updateConsumerProperties(ImmutableMap.of("auto.offset.reset",
>>> "earliest"))
>>> .updateConsumerProperties(ImmutableMap.of("group.id",
>>> "beam-consumers"))
>>> .updateConsumerProperties(ImmutableMap.of("enable.auto.commit",
>>> "true"))
>>> .withTimestampPolicyFactory(
>>> TimestampPolicyFactory.withTimestampFn(
>>> new MessageTimestampExtractor(inputMessagesConfig)))
>>> .withReadCommitted()
>>> .commitOffsetsInFinalize())
>>>
>>>
>>> and I like to get deadletter outputs when my timestamp extraction fails.
>>>
>>> Best,
>>> Tobi
>>>
>>>

Build failed in Jenkins: beam_SeedJob #2851

See 

--
GitHub pull request #6802 of commit 4ae0d3d8507e1a71618443f0fcd51547417bd049, 
no merge conflicts.
Setting status of 4ae0d3d8507e1a71618443f0fcd51547417bd049 to PENDING with url 
https://builds.apache.org/job/beam_SeedJob/2851/ and message: 'Build started 
for merge commit.'
Using context: Jenkins: Seed Job
[EnvInject] - Loading node environment variables.
Building remotely on beam12 (beam) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/beam.git # timeout=10
Fetching upstream changes from https://github.com/apache/beam.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/beam.git 
 > +refs/heads/*:refs/remotes/origin/* 
 > +refs/pull/6802/*:refs/remotes/origin/pr/6802/*
 > git rev-parse refs/remotes/origin/pr/6802/merge^{commit} # timeout=10
 > git rev-parse refs/remotes/origin/origin/pr/6802/merge^{commit} # timeout=10
Checking out Revision bfc8641e65737fd2eb0a78997e2dc4fac0eb50bf 
(refs/remotes/origin/pr/6802/merge)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f bfc8641e65737fd2eb0a78997e2dc4fac0eb50bf
Commit message: "Merge 4ae0d3d8507e1a71618443f0fcd51547417bd049 into 
7800c3078d8ecaee7d2e789f02b759e579263249"
First time build. Skipping changelog.
Cleaning workspace
 > git rev-parse --verify HEAD # timeout=10
Resetting working tree
 > git reset --hard # timeout=10
 > git clean -fdx # timeout=10
Processing DSL script job_00_seed.groovy
Processing DSL script job_Dependency_Check.groovy
Processing DSL script job_Inventory.groovy
Processing DSL script job_PerformanceTests_Dataflow.groovy
Processing DSL script job_PerformanceTests_FileBasedIO_IT.groovy
Processing DSL script job_PerformanceTests_FileBasedIO_IT_HDFS.groovy
Processing DSL script job_PerformanceTests_HadoopInputFormat.groovy
Processing DSL script job_PerformanceTests_JDBC.groovy
Processing DSL script job_PerformanceTests_MongoDBIO_IT.groovy
Processing DSL script job_PerformanceTests_Python.groovy
Processing DSL script job_PerformanceTests_Spark.groovy
Processing DSL script job_PostCommit_Go_GradleBuild.groovy
Processing DSL script job_PostCommit_Java_GradleBuild.groovy
Processing DSL script job_PostCommit_Java_Nexmark_Dataflow.groovy
Processing DSL script job_PostCommit_Java_Nexmark_Direct.groovy
Processing DSL script job_PostCommit_Java_Nexmark_Flink.groovy
Processing DSL script job_PostCommit_Java_Nexmark_Spark.groovy
Processing DSL script job_PostCommit_Java_PortableValidatesRunner_Flink.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Apex.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Dataflow.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Flink.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Gearpump.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Samza.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Spark.groovy
Processing DSL script job_PostCommit_Python_ValidatesContainer_Dataflow.groovy
Processing DSL script job_PostCommit_Python_ValidatesRunner_Dataflow.groovy
Processing DSL script job_PostCommit_Python_ValidatesRunner_Flink.groovy
Processing DSL script job_PostCommit_Python_Verify.groovy
Processing DSL script job_PostCommit_Website_Publish.groovy
Processing DSL script job_PostRelease_NightlySnapshot.groovy
Processing DSL script job_PreCommit_CommunityMetrics.groovy
Processing DSL script job_PreCommit_Go.groovy
Processing DSL script job_PreCommit_Java.groovy
Processing DSL script job_PreCommit_Java_Examples_Dataflow.groovy
Processing DSL script job_PreCommit_Python.groovy
Processing DSL script job_PreCommit_RAT.groovy
ERROR: startup failed:
job_PreCommit_RAT.groovy: 25: unexpected token: } @ line 25, column 1.
   }
   ^

1 error

Not sending mail to unregistered user ke...@google.com

Re: KafkaIO - Deadletter output

2018-10-23 Thread Chamikara Jayalath

Given that KafkaIO uses UnboundeSource framework, this is probably not
something that can easily be supported. We might be able to support similar
features when we have Kafka on top of Splittable DoFn though. So feel free
to create a feature request JIRA for this.

Thanks,
Cham

On Tue, Oct 23, 2018 at 7:43 PM Kenneth Knowles  wrote:

> This is a great question. I've added the dev list to be sure it gets
> noticed by whoever may know best.
>
> Kenn
>
> On Tue, Oct 23, 2018 at 2:05 AM Kaymak, Tobias 
> wrote:
>
>>
>> Hi,
>>
>> Is there a way to get a Deadletter Output from a pipeline that uses a
>> KafkaIO
>> connector for it's input? As `TimestampPolicyFactory.withTimestampFn()`
>> takes
>> only a SerializableFunction and not a ParDo, how would I be able to
>> produce a
>> Deadletter output from it?
>>
>> I have the following pipeline defined that reads from a KafkaIO input:
>>
>> pipeline.apply(
>>   KafkaIO.read()
>> .withBootstrapServers(bootstrap)
>> .withTopics(topics)
>> .withKeyDeserializer(StringDeserializer.class)
>> .withValueDeserializer(ConfigurableDeserializer.class)
>> .updateConsumerProperties(
>> ImmutableMap.of(InputMessagesConfig.CONFIG_PROPERTY_NAME,
>> inputMessagesConfig))
>> .updateConsumerProperties(ImmutableMap.of("auto.offset.reset",
>> "earliest"))
>> .updateConsumerProperties(ImmutableMap.of("group.id",
>> "beam-consumers"))
>> .updateConsumerProperties(ImmutableMap.of("enable.auto.commit",
>> "true"))
>> .withTimestampPolicyFactory(
>> TimestampPolicyFactory.withTimestampFn(
>> new MessageTimestampExtractor(inputMessagesConfig)))
>> .withReadCommitted()
>> .commitOffsetsInFinalize())
>>
>>
>> and I like to get deadletter outputs when my timestamp extraction fails.
>>
>> Best,
>> Tobi
>>
>>

Build failed in Jenkins: beam_SeedJob #2850

See 

--
GitHub pull request #6802 of commit 0845ab25a592350c87ec3443d83203c93553d5aa, 
no merge conflicts.
Setting status of 0845ab25a592350c87ec3443d83203c93553d5aa to PENDING with url 
https://builds.apache.org/job/beam_SeedJob/2850/ and message: 'Build started 
for merge commit.'
Using context: Jenkins: Seed Job
[EnvInject] - Loading node environment variables.
Building remotely on beam13 (beam) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/beam.git # timeout=10
Fetching upstream changes from https://github.com/apache/beam.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/beam.git 
 > +refs/heads/*:refs/remotes/origin/* 
 > +refs/pull/6802/*:refs/remotes/origin/pr/6802/*
 > git rev-parse refs/remotes/origin/pr/6802/merge^{commit} # timeout=10
 > git rev-parse refs/remotes/origin/origin/pr/6802/merge^{commit} # timeout=10
Checking out Revision d06d7569266d15529f6aae73a0cc72e353f131aa 
(refs/remotes/origin/pr/6802/merge)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f d06d7569266d15529f6aae73a0cc72e353f131aa
Commit message: "Merge 0845ab25a592350c87ec3443d83203c93553d5aa into 
7800c3078d8ecaee7d2e789f02b759e579263249"
First time build. Skipping changelog.
Cleaning workspace
 > git rev-parse --verify HEAD # timeout=10
Resetting working tree
 > git reset --hard # timeout=10
 > git clean -fdx # timeout=10
Processing DSL script job_00_seed.groovy
Processing DSL script job_Dependency_Check.groovy
Processing DSL script job_Inventory.groovy
Processing DSL script job_PerformanceTests_Dataflow.groovy
Processing DSL script job_PerformanceTests_FileBasedIO_IT.groovy
Processing DSL script job_PerformanceTests_FileBasedIO_IT_HDFS.groovy
Processing DSL script job_PerformanceTests_HadoopInputFormat.groovy
Processing DSL script job_PerformanceTests_JDBC.groovy
Processing DSL script job_PerformanceTests_MongoDBIO_IT.groovy
Processing DSL script job_PerformanceTests_Python.groovy
Processing DSL script job_PerformanceTests_Spark.groovy
Processing DSL script job_PostCommit_Go_GradleBuild.groovy
Processing DSL script job_PostCommit_Java_GradleBuild.groovy
Processing DSL script job_PostCommit_Java_Nexmark_Dataflow.groovy
Processing DSL script job_PostCommit_Java_Nexmark_Direct.groovy
Processing DSL script job_PostCommit_Java_Nexmark_Flink.groovy
Processing DSL script job_PostCommit_Java_Nexmark_Spark.groovy
Processing DSL script job_PostCommit_Java_PortableValidatesRunner_Flink.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Apex.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Dataflow.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Flink.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Gearpump.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Samza.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Spark.groovy
Processing DSL script job_PostCommit_Python_ValidatesContainer_Dataflow.groovy
Processing DSL script job_PostCommit_Python_ValidatesRunner_Dataflow.groovy
Processing DSL script job_PostCommit_Python_ValidatesRunner_Flink.groovy
Processing DSL script job_PostCommit_Python_Verify.groovy
Processing DSL script job_PostCommit_Website_Publish.groovy
Processing DSL script job_PostRelease_NightlySnapshot.groovy
Processing DSL script job_PreCommit_CommunityMetrics.groovy
Processing DSL script job_PreCommit_Go.groovy
Processing DSL script job_PreCommit_Java.groovy
Processing DSL script job_PreCommit_Java_Examples_Dataflow.groovy
ERROR: startup failed:
job_PreCommit_Java_Examples_Dataflow.groovy: 24: expecting ''', found '\n' @ 
line 24, column 50.
   avaExamplesDataflowPreCommit",
 ^

1 error

Not sending mail to unregistered user ke...@google.com

Re: KafkaIO - Deadletter output

This is a great question. I've added the dev list to be sure it gets
noticed by whoever may know best.

Kenn

On Tue, Oct 23, 2018 at 2:05 AM Kaymak, Tobias 
wrote:

>
> Hi,
>
> Is there a way to get a Deadletter Output from a pipeline that uses a
> KafkaIO
> connector for it's input? As `TimestampPolicyFactory.withTimestampFn()`
> takes
> only a SerializableFunction and not a ParDo, how would I be able to
> produce a
> Deadletter output from it?
>
> I have the following pipeline defined that reads from a KafkaIO input:
>
> pipeline.apply(
>   KafkaIO.read()
> .withBootstrapServers(bootstrap)
> .withTopics(topics)
> .withKeyDeserializer(StringDeserializer.class)
> .withValueDeserializer(ConfigurableDeserializer.class)
> .updateConsumerProperties(
> ImmutableMap.of(InputMessagesConfig.CONFIG_PROPERTY_NAME,
> inputMessagesConfig))
> .updateConsumerProperties(ImmutableMap.of("auto.offset.reset",
> "earliest"))
> .updateConsumerProperties(ImmutableMap.of("group.id",
> "beam-consumers"))
> .updateConsumerProperties(ImmutableMap.of("enable.auto.commit",
> "true"))
> .withTimestampPolicyFactory(
> TimestampPolicyFactory.withTimestampFn(
> new MessageTimestampExtractor(inputMessagesConfig)))
> .withReadCommitted()
> .commitOffsetsInFinalize())
>
>
> and I like to get deadletter outputs when my timestamp extraction fails.
>
> Best,
> Tobi
>
>

Re: Follow up ideas, to simplify creating MonitoringInfos.

FWIW AutoValue will build most of that class for you, if it is as you say.

Kenn

On Tue, Oct 23, 2018 at 6:04 PM Alex Amato  wrote:

> Hi Robert + beam dev list,
>
> I was thinking about your feedback in PR#6205
> , and agree that this
> monitoring_infos.py became a bit big.
>
> I'm working on the Java Implementation of this now, and want to
> incorporate some of these ideas and improve on this.
>
> I that that I should make something like a MonitoringInfoBuilder class.
> With a few methods
>
>- setUrn
>- setTimestamp
>- setting the value (One method for each Type we support
>
> .
>Setting this will also set the type string)
>   - setInt64CounterValue
>   - setDoubleCounterValue
>   - setLatestInt64
>   - setTopNInt64
>   - setMonitoringDataTable
>   - setDistributionInt64
>   - ...
>- setting labels (will set the key and value properly for the label)
>   - setPTransform(value)
>   - setPcollection(value)
>   - ...
>
>
> I think this will make building a metric much easier, you would just call
> the 4 methods and the .build(). These builders are common in Java. (I guess
> there is a similar thing way we could do in python? I'd like to go back and
> refactor that as well when I am done)
>
> -
>
> As for your other suggestion to define metrics in the proto/enum file
> instead of the yaml file. I am not too sure about the best strategy for
> this. My initial thoughts are:
>
>1. Make a proto extension allowing you to describe/define a
>MonitoringInfo's (the same info as the metric_definitions.yaml
>
> 
>file):
>   1. URN
>   2. Type
>   3. Labels required
>   4. Annotations: Description, Units, etc.
>2. Make the builder read in that MonitoringInfo definision/description
>assert everything is set properly? I think this would be a decent data
>driven approach.
>
> I was wondering if you had something else in mind?
>
> Thanks
> Alex
>
>
>

Follow up ideas, to simplify creating MonitoringInfos.

2018-10-23 Thread Alex Amato

Hi Robert + beam dev list,

I was thinking about your feedback in PR#6205
, and agree that this
monitoring_infos.py
https://github.com/apache/beam/blob/61a9f7193f1a61869915da3b4f386b34eac63822/sdks/python/apache_beam/metrics/monitoring_infos.py>
became a bit big.

I'm working on the Java Implementation of this now, and want to incorporate
some of these ideas and improve on this.

I that that I should make something like a MonitoringInfoBuilder class.
With a few methods

- setUrn
- setTimestamp
- setting the value (One method for each Type we support

.
Setting this will also set the type string)
- setInt64CounterValue
- setDoubleCounterValue
- setLatestInt64
- setTopNInt64
- setMonitoringDataTable
- setDistributionInt64
- ...
- setting labels (will set the key and value properly for the label)
- setPTransform(value)
- setPcollection(value)
- ...

I think this will make building a metric much easier, you would just call
the 4 methods and the .build(). These builders are common in Java. (I guess
there is a similar thing way we could do in python? I'd like to go back and
refactor that as well when I am done)

As for your other suggestion to define metrics in the proto/enum file
instead of the yaml file. I am not too sure about the best strategy for
this. My initial thoughts are:

1. Make a proto extension allowing you to describe/define a
MonitoringInfo's (the same info as the metric_definitions.yaml

file):
1. URN
2. Type
3. Labels required
4. Annotations: Description, Units, etc.
2. Make the builder read in that MonitoringInfo definision/description
assert everything is set properly? I think this would be a decent data
driven approach.

I was wondering if you had something else in mind?

Thanks
Alex

Build failed in Jenkins: beam_SeedJob #2849

See 


Changes:

[kedin] [SQL] Move builtin aggregations creation to a map of factories

[kedin] [SQL] Simplify AggregationRel

[kedin] [SQL] Add AggregationCall wrapper

[kedin] [SQL] Inline aggregation rel helper transforms

[kedin] [SQL] Move CombineFn creation to AggregationCall constructor

[kedin] [SQL] Split and rename Aggregation CombineFn wrappers

[kedin] [SQL] Make AggregationCombineFnAdapter non-AutoValue

[kedin] [SQL] Convert ifs to guard statements in AggregationCombineFnAdapter

[kedin] [SQL] Convert Covariance to accept rows instead of KVs

[kedin] [SQL] Split Args Adapters from AggregationCombineFnAdapter

[kedin] [SQL] Extract MultipleAggregationFn from BeamAggregationTransforms

[kedin] [SQL] Clean up, comment aggregation transforms

[kenn] [BEAM-5833] Fix checkstyle breakage

[scott] [BEAM-5837] Add initial jenkins job to verify community metrics infra.

[scott] Create separate :rat precommit and remove it from others.

--
Started by timer
[EnvInject] - Loading node environment variables.
Building remotely on beam13 (beam) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/beam.git # timeout=10
Fetching upstream changes from https://github.com/apache/beam.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/beam.git 
 > +refs/heads/*:refs/remotes/origin/* 
 > +refs/pull/${ghprbPullId}/*:refs/remotes/origin/pr/${ghprbPullId}/*
 > git rev-parse origin/master^{commit} # timeout=10
Checking out Revision 5e603ad4c642cfba0a6db70abd05ed8e9d89c7d6 (origin/master)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 5e603ad4c642cfba0a6db70abd05ed8e9d89c7d6
Commit message: "Merge pull request #6805:  Create separate :rat precommit and 
remove it from others"
 > git rev-list --no-walk 406a71a640e45608f0b21cee4b61ddb1201f7e23 # timeout=10
Cleaning workspace
 > git rev-parse --verify HEAD # timeout=10
Resetting working tree
 > git reset --hard # timeout=10
 > git clean -fdx # timeout=10
Processing DSL script job_00_seed.groovy
Processing DSL script job_Dependency_Check.groovy
Processing DSL script job_Inventory.groovy
Processing DSL script job_PerformanceTests_Dataflow.groovy
Processing DSL script job_PerformanceTests_FileBasedIO_IT.groovy
Processing DSL script job_PerformanceTests_FileBasedIO_IT_HDFS.groovy
Processing DSL script job_PerformanceTests_HadoopInputFormat.groovy
Processing DSL script job_PerformanceTests_JDBC.groovy
Processing DSL script job_PerformanceTests_MongoDBIO_IT.groovy
Processing DSL script job_PerformanceTests_Python.groovy
Processing DSL script job_PerformanceTests_Spark.groovy
Processing DSL script job_PostCommit_Go_GradleBuild.groovy
Processing DSL script job_PostCommit_Java_GradleBuild.groovy
Processing DSL script job_PostCommit_Java_Nexmark_Dataflow.groovy
Processing DSL script job_PostCommit_Java_Nexmark_Direct.groovy
Processing DSL script job_PostCommit_Java_Nexmark_Flink.groovy
Processing DSL script job_PostCommit_Java_Nexmark_Spark.groovy
Processing DSL script job_PostCommit_Java_PortableValidatesRunner_Flink.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Apex.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Dataflow.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Flink.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Gearpump.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Samza.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Spark.groovy
Processing DSL script job_PostCommit_Python_ValidatesContainer_Dataflow.groovy
Processing DSL script job_PostCommit_Python_ValidatesRunner_Dataflow.groovy
Processing DSL script job_PostCommit_Python_ValidatesRunner_Flink.groovy
Processing DSL script job_PostCommit_Python_Verify.groovy
Processing DSL script job_PostCommit_Website_Publish.groovy
Processing DSL script job_PostRelease_NightlySnapshot.groovy
Processing DSL script job_PreCommit_CommunityMetrics.groovy
Processing DSL script job_PreCommit_Go.groovy
Processing DSL script job_PreCommit_Java.groovy
Processing DSL script job_PreCommit_Python.groovy
Processing DSL script job_PreCommit_RAT.groovy
ERROR: startup failed:
job_PreCommit_RAT.groovy: 25: unexpected token: } @ line 25, column 1.
   }
   ^

1 error

Not sending mail to unregistered user ke...@google.com

Build failed in Jenkins: beam_SeedJob_Standalone #1803

See 


Changes:

[kedin] [SQL] Move builtin aggregations creation to a map of factories

[kedin] [SQL] Simplify AggregationRel

[kedin] [SQL] Add AggregationCall wrapper

[kedin] [SQL] Inline aggregation rel helper transforms

[kedin] [SQL] Move CombineFn creation to AggregationCall constructor

[kedin] [SQL] Split and rename Aggregation CombineFn wrappers

[kedin] [SQL] Make AggregationCombineFnAdapter non-AutoValue

[kedin] [SQL] Convert ifs to guard statements in AggregationCombineFnAdapter

[kedin] [SQL] Convert Covariance to accept rows instead of KVs

[kedin] [SQL] Split Args Adapters from AggregationCombineFnAdapter

[kedin] [SQL] Extract MultipleAggregationFn from BeamAggregationTransforms

[kedin] [SQL] Clean up, comment aggregation transforms

[kenn] [BEAM-5833] Fix checkstyle breakage

[scott] [BEAM-5837] Add initial jenkins job to verify community metrics infra.

[scott] Create separate :rat precommit and remove it from others.

--
Started by timer
[EnvInject] - Loading node environment variables.
Building remotely on beam10 (beam) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/beam.git # timeout=10
Fetching upstream changes from https://github.com/apache/beam.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/beam.git 
 > +refs/heads/*:refs/remotes/origin/* 
 > +refs/pull/${ghprbPullId}/*:refs/remotes/origin/pr/${ghprbPullId}/*
 > git rev-parse origin/master^{commit} # timeout=10
Checking out Revision 5e603ad4c642cfba0a6db70abd05ed8e9d89c7d6 (origin/master)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 5e603ad4c642cfba0a6db70abd05ed8e9d89c7d6
Commit message: "Merge pull request #6805:  Create separate :rat precommit and 
remove it from others"
 > git rev-list --no-walk 406a71a640e45608f0b21cee4b61ddb1201f7e23 # timeout=10
Cleaning workspace
 > git rev-parse --verify HEAD # timeout=10
Resetting working tree
 > git reset --hard # timeout=10
 > git clean -fdx # timeout=10
Processing DSL script job_00_seed.groovy
Processing DSL script job_Dependency_Check.groovy
Processing DSL script job_Inventory.groovy
Processing DSL script job_PerformanceTests_Dataflow.groovy
Processing DSL script job_PerformanceTests_FileBasedIO_IT.groovy
Processing DSL script job_PerformanceTests_FileBasedIO_IT_HDFS.groovy
Processing DSL script job_PerformanceTests_HadoopInputFormat.groovy
Processing DSL script job_PerformanceTests_JDBC.groovy
Processing DSL script job_PerformanceTests_MongoDBIO_IT.groovy
Processing DSL script job_PerformanceTests_Python.groovy
Processing DSL script job_PerformanceTests_Spark.groovy
Processing DSL script job_PostCommit_Go_GradleBuild.groovy
Processing DSL script job_PostCommit_Java_GradleBuild.groovy
Processing DSL script job_PostCommit_Java_Nexmark_Dataflow.groovy
Processing DSL script job_PostCommit_Java_Nexmark_Direct.groovy
Processing DSL script job_PostCommit_Java_Nexmark_Flink.groovy
Processing DSL script job_PostCommit_Java_Nexmark_Spark.groovy
Processing DSL script job_PostCommit_Java_PortableValidatesRunner_Flink.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Apex.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Dataflow.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Flink.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Gearpump.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Samza.groovy
Processing DSL script job_PostCommit_Java_ValidatesRunner_Spark.groovy
Processing DSL script job_PostCommit_Python_ValidatesContainer_Dataflow.groovy
Processing DSL script job_PostCommit_Python_ValidatesRunner_Dataflow.groovy
Processing DSL script job_PostCommit_Python_ValidatesRunner_Flink.groovy
Processing DSL script job_PostCommit_Python_Verify.groovy
Processing DSL script job_PostCommit_Website_Publish.groovy
Processing DSL script job_PostRelease_NightlySnapshot.groovy
Processing DSL script job_PreCommit_CommunityMetrics.groovy
Processing DSL script job_PreCommit_Go.groovy
Processing DSL script job_PreCommit_Java.groovy
Processing DSL script job_PreCommit_Python.groovy
Processing DSL script job_PreCommit_RAT.groovy
ERROR: startup failed:
job_PreCommit_RAT.groovy: 25: unexpected token: } @ line 25, column 1.
   }
   ^

1 error

Not sending mail to unregistered user ke...@google.com

[SQL] Investigation of missing/wrong session_end implementation in BeamSQL

2018-10-23 Thread Rui Wang

Hi community,

In BeamSQL, SESSION window is supported in GROUP BY. Example query:

"SELECT f_int2, COUNT(*) AS `getFieldCount`,"
+ " SESSION_START(f_timestamp, INTERVAL '5' MINUTE) AS `window_start`, "
+ " SESSION_END(f_timestamp, INTERVAL '5' MINUTE) AS `window_end` "
+ " FROM TABLE_A"
+ " GROUP BY f_int2, SESSION(f_timestamp, INTERVAL '5' MINUTE)";


However, I observed SESSION_END (window_end) always returns the same
timestamp as what SESSION_START(window_start) returns, so BeamSQL misses
the implementation to SESSION_END. Here is something about the
investigation of root cause and proposed fix:

*Why we are not missing tumble_end and hop_end?*
Because when generating logical plan, Calcite replaces tumble_start and
hop_start with a reference to GROUP BY's TUMBLE/HOP. The GROUP BY's
TUMBLE/HOP is supposed to return a timestamp. Then Calcite replaces
tumble_end and hop_end with a PLUS(timestamp reference, window_size as a
constant). As tumble and hop has a fixed window size as constants in their
function signatures, Calcite generates the PLUS in logical plan, which
means for tumble and hop, we only need a timestamp (which represents
window_start in our implementation) to generate both window_start and
window_end in Projection.

We are emitting window_start timestamp as the result of TUMBLE/HOP/SESSION
functions:
https://github.com/amaliujia/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/transform/BeamAggregationTransforms.java#L84



*Why we are missing session_end?*Because Calcite does not know what's the
window size of session window, so in logical plan, Calcite generates a
reference to GROUP BY's SESSION for session_end, as the same as the
reference generated for session_start. So in logical plan, session_start =
session_end. Because in BeamSQL, we don't differentiate session with tumble
and hop, so we returns window start as the result of SESSION function, and
then in the final result, we see session_start = session_end.

*Is this a Calcite bug?*
Yes and No.

Clearly Calcite shouldn't hide window_end by creating a wrong reference in
logical plan. If Calcite does not know what's session_end, it should at
least keep it. Ideally Calcite should keep window_end in logical plan and
let us decide what it means: either a reference or a PLUS or something else.

However, Calcite leaves space for us to add the window_end back in physical
plan nodes. For example, we can add window_end back in BeamAggregationRel.
We can probably change the reference of session_end to a reference to our
window_end in BeamAggregationRel.

*What is the fix?*
In BeamAggregationRel, we should add a window_end right after window
functions. We can emit window_end timestamp for the added field. And in
Projection, we should change window_end from a PLUS (for tumble and hop)
and a wrong reference (for session) to a right reference to the newly added
window_end in Aggregation.

Jira: https://issues.apache.org/jira/browse/BEAM-5843


-Rui

Re: [DISCUSS] Publish vendored dependencies independently

I think it makes sense for each vendored dependency to be self-contained as
much as possible. It should keep it fairly simple. Things that cross their
API surface cannot be hidden, of course. Jar size is not a concern IMO.

Kenn

On Tue, Oct 23, 2018 at 9:05 AM Lukasz Cwik  wrote:

> How should we handle the transitive dependencies of the things we want to
> vendor?
>
> For example we use gRPC which depends on Guava 20 and we also use Calcite
> which depends on Guava 19.
>
> Should the vendored gRPC/Calcite/... be self-contained so it contains all
> its dependencies, hence vendored gRPC would contain Guava 20 and vendored
> Calcite would contain Guava 19 (both under different namespaces)?
> This leads to larger jars but less vendored dependencies to maintain.
>
> Or should we produce a vendored library for those that we want to share,
> e.g. Guava 20 that could be reused across multiple vendored libraries?
> Makes the vendoring process slightly more complicated, more dependencies
> to maintain, smaller jars.
>
> Or should we produce a vendored library for each dependency?
> Lots of vendoring needed, likely tooling required to be built to maintain
> this.
>
>
>
>
> On Tue, Oct 23, 2018 at 8:46 AM Kenneth Knowles  wrote:
>
>> I actually created the subtasks by finding things shaded by at least one
>> module. I think each one should definitely have an on list discussion that
>> clarifies the target artifact, namespace, version, possible complications,
>> etc.
>>
>> My impression is that many many modules shade only Guava. So for build
>> time and simplification that is a big win.
>>
>> Kenn
>>
>> On Tue, Oct 23, 2018, 08:16 Thomas Weise  wrote:
>>
>>> +1 for separate artifacts
>>>
>>> I would request that we explicitly discuss and agree which dependencies
>>> we vendor though.
>>>
>>> Not everything listed in the JIRA subtasks is currently relocated.
>>>
>>> Thomas
>>>
>>>
>>> On Tue, Oct 23, 2018 at 8:04 AM David Morávek 
>>> wrote:
>>>
 +1 This should improve build times a lot. It would be great if vendored
 deps could stay in the main repository.

 D.

 On Tue, Oct 23, 2018 at 12:21 PM Maximilian Michels 
 wrote:

> Looks great, Kenn!
>
> > Max: what is the story behind having a separate flink-shaded repo?
> Did that make it easier to manage in some way?
>
> Better separation of concerns, but I don't think releasing the shaded
> artifacts from the main repo is a problem. I'd even prefer not to
> split
> up the repo because it makes updates to the vendored dependencies
> slightly easier.
>
> On 23.10.18 03:25, Kenneth Knowles wrote:
> > OK, I've filed https://issues.apache.org/jira/browse/BEAM-5819 to
> > collect sub-tasks. This has enough upsides throughout lots of areas
> of
> > the project that even though it is not glamorous it seems pretty
> > valuable to start on immediately. And I want to find out if there's
> a
> > pitfall lurking.
> >
> > Max: what is the story behind having a separate flink-shaded repo?
> Did
> > that make it easier to manage in some way?
> >
> > Kenn
> >
> > On Mon, Oct 22, 2018 at 2:55 AM Maximilian Michels  > > wrote:
> >
> > +1 for publishing vendored Jars independently. It will improve
> build
> > time and ease IntelliJ integration.
> >
> > Flink also publishes shaded dependencies separately:
> >
> > - https://github.com/apache/flink-shaded
> > - https://issues.apache.org/jira/browse/FLINK-6529
> >
> > AFAIK their main motivation was to get rid of duplicate shaded
> classes
> > on the classpath. We don't appear to have that problem because we
> > already have a separate "vendor" project.
> >
> >  >  - With shading, it is hard (impossible?) to step into
> dependency
> > code in IntelliJ's debugger, because the actual symbol at runtime
> > does not match what is in the external jars
> >
> > This would be solved by releasing the sources of the shaded jars.
> >  From a
> > legal perspective, this could be problematic as alluded to here:
> > https://github.com/apache/flink-shaded/issues/25
> >
> > -Max
> >
> > On 20.10.18 01:11, Lukasz Cwik wrote:
> >  > I have tried several times to improve the build system and
> intellij
> >  > integration and each attempt ended with little progress when
> dealing
> >  > with vendored code. My latest attempt has been the most
> promising
> > where
> >  > I take the vendored classes/jars and decompile them
> generating the
> >  > source that Intellij can then use. I have a branch[1] that
> > demonstrates
> >  > the idea. It works pretty well (and up until a change where we
> > started
> >  > vendoring gRPC,

Re: What is required for LTS releases? (was: [PROPOSAL] Prepare Beam 2.8.0 release)

Yes, user@ cannot reach new users, really. Twitter might, if we have enough
of adjacent followers to get it in front of the right people. On the other
hand, I find testimonials from experience convincing in this case.

Kenn

On Tue, Oct 23, 2018 at 2:59 PM Ahmet Altay  wrote:

>
>
> On Tue, Oct 23, 2018 at 9:16 AM, Thomas Weise  wrote:
>
>>
>>
>> On Mon, Oct 22, 2018 at 2:42 PM Ahmet Altay  wrote:
>>
>>> We attempted to collect feedback on the mailing lists but did not get
>>> much input. From my experience (mostly based on dataflow) there is a
>>> sizeable group of users who are less interested in new features and want a
>>> version that is stable, that does not have security issues, major data
>>> integrity issues etc. In Beam's existing release model that corresponds to
>>> the latest release.
>>>
>>> It would help a lot if we can hear the perspectives of other users who
>>> are not present here through the developers who work with them.
>>>
>>
>> Perhaps user@ and Twitter are good ways to reach relevant audience.
>>
>
> We tried user@ before did not get any feedback [1]. Polling on twitter
> sounds like a good idea. Unless there is an objection, I can start a poll
> with Thomas's proposed text as is on Beam's twitter account.
>
> [1]
> https://lists.apache.org/thread.html/7d890d6ed221c722a95d9c773583450767b79ee0c0c78f48a56c7eba@%3Cuser.beam.apache.org%3E
>
>
>>
>> A poll could look like this:
>>
>> The Beam community is considering LTS (Long Term Support) for selected
>> releases. LTS releases would only contain critical bug fixes (security,
>> data integrity etc.) and offer an alternative to upgrading to latest Beam
>> release with new features. Please indicate your preference for Beam
>> upgrades:
>>
>> 1) Always upgrading to the latest release because I need latest features
>> along with bug fixes
>> 2) Interested to switch to LTS releases to obtain critical fixes
>> 3) Not upgrading (using older release for other reasons)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: What is required for LTS releases? (was: [PROPOSAL] Prepare Beam 2.8.0 release)

2018-10-23 Thread Ahmet Altay

On Tue, Oct 23, 2018 at 9:16 AM, Thomas Weise  wrote:

>
>
> On Mon, Oct 22, 2018 at 2:42 PM Ahmet Altay  wrote:
>
>> We attempted to collect feedback on the mailing lists but did not get
>> much input. From my experience (mostly based on dataflow) there is a
>> sizeable group of users who are less interested in new features and want a
>> version that is stable, that does not have security issues, major data
>> integrity issues etc. In Beam's existing release model that corresponds to
>> the latest release.
>>
>> It would help a lot if we can hear the perspectives of other users who
>> are not present here through the developers who work with them.
>>
>
> Perhaps user@ and Twitter are good ways to reach relevant audience.
>

We tried user@ before did not get any feedback [1]. Polling on twitter
sounds like a good idea. Unless there is an objection, I can start a poll
with Thomas's proposed text as is on Beam's twitter account.

[1]
https://lists.apache.org/thread.html/7d890d6ed221c722a95d9c773583450767b79ee0c0c78f48a56c7eba@%3Cuser.beam.apache.org%3E


>
> A poll could look like this:
>
> The Beam community is considering LTS (Long Term Support) for selected
> releases. LTS releases would only contain critical bug fixes (security,
> data integrity etc.) and offer an alternative to upgrading to latest Beam
> release with new features. Please indicate your preference for Beam
> upgrades:
>
> 1) Always upgrading to the latest release because I need latest features
> along with bug fixes
> 2) Interested to switch to LTS releases to obtain critical fixes
> 3) Not upgrading (using older release for other reasons)
>
>
>
>
>
>
>
>
>
>
>

Re: Java Precommit duration

2018-10-23 Thread Robert Bradshaw

On Tue, Oct 23, 2018 at 11:28 PM Kenneth Knowles  wrote:

> Hi all,
>
> Java Precommit duration is about 1h15. That is quite a burden. Especially
> if something gets broken.
>

I'm in favor of (simple!) build breaks going in before precommits finish,
on the promise that the offending test(s) passed locally. Short of that, we
can roll back.

If it were cheap to get a fast "this is probably good" signal, that could
be useful as well, though once you hit the "I'm waiting long enough to go
do something else" the difference between 20 minutes and 80 minutes is not
that huge.


> We turned off parallel builds, which we really need to re-enable.
>

+1


> But beyond that, I see low-hanging fruit that would most appropriately be
> a separate Jenkins job.
>
> Here's a scan of a successful run:
> https://scans.gradle.com/s/2s4bd5hc45wuy/timeline
>
> * 17m :beam-runners-google-cloud-dataflow-java-examples:preCommit
> * 4m :beam-runners-google-cloud-dataflow-java-examples-streaming:preCommit
> These are integration tests that should have their own job & status
> anyhow. We lumped them in because Maven can't do separate tests. Gradle
> makes this cheap and easy.
>
> Then there are these which are the only other tasks over 1m:
>
> * 2m :beam-runners-google-cloud-dataflow-java-legacy-worker:test
> * 2m :beam-runners-google-cloud-dataflow-java-fn-api-worker:test
> * 2m :beam-sdks-java-nexmark:test
> * 1m :beam-sdks-java-io-google-cloud-platform:test
> * 1m :beam-sdks-java-io-hbase:test
> * 1m :beam-sdks-java-extensions-sql:test
>
> Maybe not worth messing with these.  Also if we remove all the shadowJar
> and shadowTestJar tasks it actually looks like it would only save 5
> minutes, so I was hasty in thinking that would solve things. It will make
> interactive work better (going from 30s to maybe <10s for rebuilds) but
> won't help that much for Jenkins.
>
> Kenn
>

Please ignore the 'Java FnApi PreCommit' and 'Java FnApi PostCommit' failures

2018-10-23 Thread Boyuan Zhang

Hey all,

I'm working on adding 2 more Jenkins Jobs to run java PreCommit and
PostCommit with fn-api worker and stabilizing the job status. Please ignore
failures from these 2 jobs. Once they are ready, there will be another
email to follow up. Sorry for the inconvenience!

Best,
Boyuan Zhang

Java Precommit duration

Hi all,

Java Precommit duration is about 1h15. That is quite a burden. Especially
if something gets broken. We turned off parallel builds, which we really
need to re-enable. But beyond that, I see low-hanging fruit that would most
appropriately be a separate Jenkins job.

Here's a scan of a successful run:
https://scans.gradle.com/s/2s4bd5hc45wuy/timeline

* 17m :beam-runners-google-cloud-dataflow-java-examples:preCommit
* 4m :beam-runners-google-cloud-dataflow-java-examples-streaming:preCommit
These are integration tests that should have their own job & status anyhow.
We lumped them in because Maven can't do separate tests. Gradle makes this
cheap and easy.

Then there are these which are the only other tasks over 1m:

* 2m :beam-runners-google-cloud-dataflow-java-legacy-worker:test
* 2m :beam-runners-google-cloud-dataflow-java-fn-api-worker:test
* 2m :beam-sdks-java-nexmark:test
* 1m :beam-sdks-java-io-google-cloud-platform:test
* 1m :beam-sdks-java-io-hbase:test
* 1m :beam-sdks-java-extensions-sql:test

Maybe not worth messing with these.  Also if we remove all the shadowJar
and shadowTestJar tasks it actually looks like it would only save 5
minutes, so I was hasty in thinking that would solve things. It will make
interactive work better (going from 30s to maybe <10s for rebuilds) but
won't help that much for Jenkins.

Kenn

Re: [Proposal] Add exception handling option to MapElements

2018-10-23 Thread Jeff Klukas

https://github.com/apache/beam/pull/6586 is still open for review, but I
also wanted to gather feedback about a potential refactor as part of that
change.

We could refactor MapElements, FlatMapElements, and Filter to all inherit
from a common abstract base class SingleMessageTransform. The new code for
exception handling is nearly identical between the three classes and could
be consolidated without altering the current public interfaces. Are there
concerns with adding such a base class ?

On Thu, Oct 11, 2018 at 4:44 PM Jeff Klukas  wrote:

> The PR (https://github.com/apache/beam/pull/6586) is updated now with a
> coding solution for Failure. We use AvroCoder for the Exception and inherit
> whatever the input coder was for values.
>
> The unfortunate bit is that users might provide an Exception subclass that
> doesn't provide a no-argument constructor and thus isn't
> AvroCoder-compatible. I'm currently handling this through early failure
> with context about how to choose a different exception type.
>
>
> On Fri, Oct 5, 2018 at 3:59 PM Jeff Klukas  wrote:
>
>> It would be ideal to have some higher-level way of wrapping a PTransform
>> to handle errors inside, but that indeed seems like a substantially
>> trickier thing to implement.
>>
>>
>>
>>
>>
>> On Fri, Oct 5, 2018 at 3:38 PM Reuven Lax  wrote:
>>
>>> Cool! I've left a few comments.
>>>
>>> This also makes me think whether we can implement this on ParDo as well,
>>> though that might be a bit trickier since it involves hooking into
>>> DoFnInvoker.
>>>
>>> Reuven
>>>
>>> On Fri, Oct 5, 2018 at 10:33 AM Jeff Klukas  wrote:
>>>
 I've posted a full PR for the Java exception handling API that's ready
 for review: https://github.com/apache/beam/pull/6586

 It implements new WithErrors nested classes on MapElements,
 FlatMapElements, Filter, AsJsons, and ParseJsons.

 On Wed, Oct 3, 2018 at 7:55 PM Jeff Klukas  wrote:

> Jira issues for adding exception handling in Java and Python SDKs:
>
> https://issues.apache.org/jira/browse/BEAM-5638
> https://issues.apache.org/jira/browse/BEAM-5639
>
> I'll plan to have a complete PR for the Java SDK put together in the
> next few days.
>
> On Wed, Oct 3, 2018 at 1:29 PM Jeff Klukas 
> wrote:
>
>> I don't personally have experience with the Python SDK, so am not
>> immediately in a position to comment on how feasible it would be to
>> introduce a similar change there. I'll plan to write up two separate 
>> issues
>> for adding exception handling in the Java and Python SDKs.
>>
>> On Wed, Oct 3, 2018 at 12:17 PM Thomas Weise  wrote:
>>
>>> +1 for the proposal as well as the suggestion to offer it in other
>>> SDKs, where applicable
>>>
>>> On Wed, Oct 3, 2018 at 8:58 AM Chamikara Jayalath <
>>> chamik...@google.com> wrote:
>>>
 Sounds like a very good addition. I'd say this can be a single PR
 since changes are related. Please open a JIRA for tracking.

 Have you though about introducing a similar change to Python SDK ?
 (doesn't have to be the same PR).

 - Cham

 On Wed, Oct 3, 2018 at 8:31 AM Jeff Klukas 
 wrote:

> If this looks good for MapElements, I agree that it makes sense to
> extend to FlatMapElements and Filter and to keep the API consistent 
> between
> them.
>
> Do you have suggestions on how to submit changes with that wider
> scope? Would one PR altering MapElements, FlatMapElements, Filter,
> ParseJsons, and AsJsons be too large to reasonably review? Should I 
> open an
> overall JIRA ticket to track and break this into smaller  PRs?
>
> On Wed, Oct 3, 2018 at 10:31 AM Reuven Lax 
> wrote:
>
>> Sounds cool. Why not support this on other transforms as well?
>> (FlatMapElements, Filter, etc.)
>>
>> Reuven
>>
>> On Tue, Oct 2, 2018 at 4:49 PM Jeff Klukas 
>> wrote:
>>
>>> I've seen a few Beam users mention the need to handle errors in
>>> their transforms by using a try/catch and routing to different 
>>> outputs
>>> based on whether an exception was thrown. This was particularly 
>>> nicely
>>> written up in a post by Vallery Lancey:
>>>
>>>
>>> https://medium.com/@vallerylancey/error-handling-elements-in-apache-beam-pipelines-fffdea91af2a
>>>
>>> I'd love to see this pattern better supported directly in the
>>> Beam API, because it currently requires the user to implement a 
>>> full DoFn
>>> even for the simplest cases.
>>>
>>> I propose we support for a MapElements-like transform that
>>> allows the user to specify a set of exceptions to catch and route 
>>> to a

Re: Possible memory leak in Direct Runner unbounded

2018-10-23 Thread Andrew Pilloud

Hi Martin,

I've seen similar things. The Direct Runner is intended for testing with
small datasets, and is expected to retain the entire dataset in memory. It
sounds like you have a pipeline that requires storing data for a GroupByKey
operation. There is no mechanism to page intermediates to disk in the
Direct Runner.

You might want to try the Flink local runner, which should handle this case
better.

Andrew

On Sun, Oct 21, 2018 at 3:43 PM Martin Procházka 
wrote:

> Hello,
> I have got an application, which utilizes Beam pipeline - Direct Runner.
> It contains an unbounded source. I have got a frontend, which manually adds
> some data into the pipeline with the same timestamp in order to be
> processed in the same window.
>
> The pipeline runs well, however it eventually runs out of heap space. I
> have profiled the application and have noticed that there is a hotspot in
> outputWatermark - holds - keyedHolds. It gets swamped mainly by values
> keyed by the anonymous StructuralKey 'empty' classes over time. With every
> request it grows and never gets released.
>
> When I changed the empty structural key to true singleton, it solved a
> part of this issue, but I have noticed that there is a specific test that
> ensures that two empty keys (StructuralKey) are not equal so my change
> would not be valid. When are those empty keys used and when should they be
> removed in the Direct runner? Is there some mechanism to prevent the
> inevitable heap out of memory error after few requests?
>
> Regards,
> Martin Prochazka
>
>
>
>
>

Re: Data Preprocessing in Beam

2018-10-23 Thread Lukasz Cwik

Arnoud Fournier (afourn...@talend.com) started by adding a library to
support sketching (
https://github.com/apache/beam/tree/master/sdks/java/extensions/sketching),
I feel as those some of these could be added there or possibly within
another extension.

On Tue, Oct 23, 2018 at 9:54 AM Austin Bennett 
wrote:

> Hi Beam Devs,
>
> Alejandro, copied, is an enthusiastic developer, who recently coded up:
> https://github.com/elbaulp/DPASF (associated paper found:
> https://arxiv.org/abs/1810.06021).
>
> He had been looking to contribute that code to FlinkML, at which point I
> found him and alerted him to Beam.  He has been learning a bit on Beam
> recently.  Would this data-preprocessing be a welcome contribution to the
> project.  If yes, perhaps others better versed in internals (I'm not there
> yet -- though could follow along!) would be willing to provide feedback to
> shape this to be a suitable Beam contribution.
>
> Cheers,
> Austin
>
>
>

Data Preprocessing in Beam

2018-10-23 Thread Austin Bennett

Hi Beam Devs,

Alejandro, copied, is an enthusiastic developer, who recently coded up:
https://github.com/elbaulp/DPASF (associated paper found:
https://arxiv.org/abs/1810.06021).

He had been looking to contribute that code to FlinkML, at which point I
found him and alerted him to Beam.  He has been learning a bit on Beam
recently.  Would this data-preprocessing be a welcome contribution to the
project.  If yes, perhaps others better versed in internals (I'm not there
yet -- though could follow along!) would be willing to provide feedback to
shape this to be a suitable Beam contribution.

Cheers,
Austin

Re: [DISCUSS] Publish vendored dependencies independently

2018-10-23 Thread Lukasz Cwik

How should we handle the transitive dependencies of the things we want to
vendor?

For example we use gRPC which depends on Guava 20 and we also use Calcite
which depends on Guava 19.

Should the vendored gRPC/Calcite/... be self-contained so it contains all
its dependencies, hence vendored gRPC would contain Guava 20 and vendored
Calcite would contain Guava 19 (both under different namespaces)?
This leads to larger jars but less vendored dependencies to maintain.

Or should we produce a vendored library for those that we want to share,
e.g. Guava 20 that could be reused across multiple vendored libraries?
Makes the vendoring process slightly more complicated, more dependencies to
maintain, smaller jars.

Or should we produce a vendored library for each dependency?
Lots of vendoring needed, likely tooling required to be built to maintain
this.




On Tue, Oct 23, 2018 at 8:46 AM Kenneth Knowles  wrote:

> I actually created the subtasks by finding things shaded by at least one
> module. I think each one should definitely have an on list discussion that
> clarifies the target artifact, namespace, version, possible complications,
> etc.
>
> My impression is that many many modules shade only Guava. So for build
> time and simplification that is a big win.
>
> Kenn
>
> On Tue, Oct 23, 2018, 08:16 Thomas Weise  wrote:
>
>> +1 for separate artifacts
>>
>> I would request that we explicitly discuss and agree which dependencies
>> we vendor though.
>>
>> Not everything listed in the JIRA subtasks is currently relocated.
>>
>> Thomas
>>
>>
>> On Tue, Oct 23, 2018 at 8:04 AM David Morávek 
>> wrote:
>>
>>> +1 This should improve build times a lot. It would be great if vendored
>>> deps could stay in the main repository.
>>>
>>> D.
>>>
>>> On Tue, Oct 23, 2018 at 12:21 PM Maximilian Michels 
>>> wrote:
>>>
 Looks great, Kenn!

 > Max: what is the story behind having a separate flink-shaded repo?
 Did that make it easier to manage in some way?

 Better separation of concerns, but I don't think releasing the shaded
 artifacts from the main repo is a problem. I'd even prefer not to split
 up the repo because it makes updates to the vendored dependencies
 slightly easier.

 On 23.10.18 03:25, Kenneth Knowles wrote:
 > OK, I've filed https://issues.apache.org/jira/browse/BEAM-5819 to
 > collect sub-tasks. This has enough upsides throughout lots of areas
 of
 > the project that even though it is not glamorous it seems pretty
 > valuable to start on immediately. And I want to find out if there's a
 > pitfall lurking.
 >
 > Max: what is the story behind having a separate flink-shaded repo?
 Did
 > that make it easier to manage in some way?
 >
 > Kenn
 >
 > On Mon, Oct 22, 2018 at 2:55 AM Maximilian Michels >>> > > wrote:
 >
 > +1 for publishing vendored Jars independently. It will improve
 build
 > time and ease IntelliJ integration.
 >
 > Flink also publishes shaded dependencies separately:
 >
 > - https://github.com/apache/flink-shaded
 > - https://issues.apache.org/jira/browse/FLINK-6529
 >
 > AFAIK their main motivation was to get rid of duplicate shaded
 classes
 > on the classpath. We don't appear to have that problem because we
 > already have a separate "vendor" project.
 >
 >  >  - With shading, it is hard (impossible?) to step into
 dependency
 > code in IntelliJ's debugger, because the actual symbol at runtime
 > does not match what is in the external jars
 >
 > This would be solved by releasing the sources of the shaded jars.
 >  From a
 > legal perspective, this could be problematic as alluded to here:
 > https://github.com/apache/flink-shaded/issues/25
 >
 > -Max
 >
 > On 20.10.18 01:11, Lukasz Cwik wrote:
 >  > I have tried several times to improve the build system and
 intellij
 >  > integration and each attempt ended with little progress when
 dealing
 >  > with vendored code. My latest attempt has been the most
 promising
 > where
 >  > I take the vendored classes/jars and decompile them generating
 the
 >  > source that Intellij can then use. I have a branch[1] that
 > demonstrates
 >  > the idea. It works pretty well (and up until a change where we
 > started
 >  > vendoring gRPC, was impractical to do. Instructions to try it
 out
 > are:
 >  >
 >  > // Clean up any remnants of prior builds/intellij projects
 >  > git clean -fdx
 >  > // Generated the source for vendored/shaded modules
 >  > ./gradlew decompile
 >  >
 >  > // Remove the "generated" Java sources for protos so they don't
 > conflict with the decompiled sources.
 >

Re: Docker missing on Beam15

Thanks! There have been a few successful runs now.

On Tue, Oct 23, 2018 at 8:52 AM Yifan Zou  wrote:

> FYI, the docker was restarted on beam15.
>
> On Tue, Oct 23, 2018 at 7:08 AM Thomas Weise  wrote:
>
>> For the latter (createProcessWorker):
>> https://github.com/apache/beam/pull/6793
>>
>>
>> On Tue, Oct 23, 2018 at 6:47 AM Thomas Weise  wrote:
>>
>>> Thanks for taking a look Yifan. Yes, it appears this was an intermittent
>>> issue.
>>>
>>> For beam_PostCommit_Python_VR_Flink we are left with:
>>>
>>> * beam15 docker errors
>>> * segmentation faults
>>> * "Execution failed for task ':beam-sdks-python:createProcessWorker'" -
>>> which should not even execute since we are using Docker
>>>
>>>
>>> On Mon, Oct 22, 2018 at 10:50 PM Yifan Zou  wrote:
>>>
 I'm not able to reproduce that error in Beam6 (#459
 ,
 #460
 ),
 it probably due to some outage of Debian [1]. The image was successfully
 built, but the test failed in other reasons.
 And indeed, the beam_PostCommit_Python_VR_Flink is very flaky.

 Yifan

 [1] https://github.com/docker-library/python/issues/241

 On Mon, Oct 22, 2018 at 5:39 PM Thomas Weise  wrote:

> Looks like we have more container build related errors.
>
> This is from beam6 -
> https://builds.apache.org/job/beam_PostCommit_Python_VR_Flink_PR/44/
>
> Reading package lists...
> [91mW: The repository 'http://deb.debian.org/debian stretch Release'
> does not have a Release file.
>
> W: The repository 'http://deb.debian.org/debian stretch-updates Release' 
> does not have a Release file.
> E: Failed to fetch 
> http://deb.debian.org/debian/dists/stretch/main/binary-amd64/Packages  
> 404  Not Found
> E: Failed to fetch 
> http://deb.debian.org/debian/dists/stretch-updates/main/binary-amd64/Packages
>   404  Not Found
> E: Some index files failed to download. They have been ignored, or old 
> ones used instead.
>
>
> On Mon, Oct 22, 2018 at 2:54 PM Ankur Goenka 
> wrote:
>
>> Thanks Yifan!
>>
>> On Mon, Oct 22, 2018 at 2:53 PM Yifan Zou 
>> wrote:
>>
>>> So, looks like none of us have the permissions. I filed INFRA-17167
>>>  to the Infra
>>> team to restart the docker on the beam15.
>>>
>>> Thanks.
>>> Yifan
>>>
>>> On Mon, Oct 22, 2018 at 9:20 AM Scott Wegner 
>>> wrote:
>>>
 I've seen the docker issue pop-up on website pre-commits as well:
 https://issues.apache.org/jira/browse/BEAM-5783. There were also
 on beam15.

 When I searched around the internet I found lots of instances of
 the same error; it seems to be some unreliability in the guts of Docker
 [1]. Perhaps restarting the VM or docker daemon could help. Does 
 anybody
 have permissions to log on and try it?

 [1]
 https://github.com/moby/moby/issues/31849#issuecomment-320236354

 On Sun, Oct 21, 2018 at 7:13 PM Thomas Weise 
 wrote:

> There are two issues with
> https://builds.apache.org/job/beam_PostCommit_Python_VR_Flink/
> currently:
>
> 1) The mentioned issue with docker on beam15 - Jason, can you
> possibly advise how to deal with it?
>
> 2) Frequent failure due to "Segmentation fault (core dumped)", as
> exhibited by
> https://builds.apache.org/job/beam_PostCommit_Python_VR_Flink/449/consoleText
>
> The Gradle scan is here:
>
>
> https://scans.gradle.com/s/ebhxs4l65cow4/failure?openFailures=WzBd=WzEse31d#top=0
>
> There are multiple of those in sequence on beam13
>
> Some more comments:
> https://issues.apache.org/jira/browse/BEAM-5467
>
> Any help to further investigate or fix would be appreciated!
>
> Thanks,
> Thomas
>
>
>
> On Fri, Oct 19, 2018 at 4:51 PM Yifan Zou 
> wrote:
>
>> I got "Failed to restart docker.service: Interactive
>> authentication required" while trying to restart the docker on
>> beam15.
>> Does anyone have the permission to do that? Or, we need to ask
>> Apache Infra for help.
>>
>> Thanks.
>> Yifan
>>
>> On Fri, Oct 19, 2018 at 2:51 PM Ankur Goenka 
>> wrote:
>>
>>> Hi,
>>>
>>> Can we restart docker as it seems to have fixed the issue for
>>> others https://github.com/moby/moby/issues/31849 ?
>>>
>>> Thanks,
>>> Ankur
>>>
>>> On Fri, Oct 19, 2018 at 1:11 PM Yifan Zou

Re: Docker missing on Beam15

2018-10-23 Thread Yifan Zou

FYI, the docker was restarted on beam15.

On Tue, Oct 23, 2018 at 7:08 AM Thomas Weise  wrote:

> For the latter (createProcessWorker):
> https://github.com/apache/beam/pull/6793
>
>
> On Tue, Oct 23, 2018 at 6:47 AM Thomas Weise  wrote:
>
>> Thanks for taking a look Yifan. Yes, it appears this was an intermittent
>> issue.
>>
>> For beam_PostCommit_Python_VR_Flink we are left with:
>>
>> * beam15 docker errors
>> * segmentation faults
>> * "Execution failed for task ':beam-sdks-python:createProcessWorker'" -
>> which should not even execute since we are using Docker
>>
>>
>> On Mon, Oct 22, 2018 at 10:50 PM Yifan Zou  wrote:
>>
>>> I'm not able to reproduce that error in Beam6 (#459
>>> ,
>>> #460
>>> ),
>>> it probably due to some outage of Debian [1]. The image was successfully
>>> built, but the test failed in other reasons.
>>> And indeed, the beam_PostCommit_Python_VR_Flink is very flaky.
>>>
>>> Yifan
>>>
>>> [1] https://github.com/docker-library/python/issues/241
>>>
>>> On Mon, Oct 22, 2018 at 5:39 PM Thomas Weise  wrote:
>>>
 Looks like we have more container build related errors.

 This is from beam6 -
 https://builds.apache.org/job/beam_PostCommit_Python_VR_Flink_PR/44/

 Reading package lists...
 [91mW: The repository 'http://deb.debian.org/debian stretch Release'
 does not have a Release file.

 W: The repository 'http://deb.debian.org/debian stretch-updates Release' 
 does not have a Release file.
 E: Failed to fetch 
 http://deb.debian.org/debian/dists/stretch/main/binary-amd64/Packages  404 
  Not Found
 E: Failed to fetch 
 http://deb.debian.org/debian/dists/stretch-updates/main/binary-amd64/Packages
   404  Not Found
 E: Some index files failed to download. They have been ignored, or old 
 ones used instead.


 On Mon, Oct 22, 2018 at 2:54 PM Ankur Goenka  wrote:

> Thanks Yifan!
>
> On Mon, Oct 22, 2018 at 2:53 PM Yifan Zou  wrote:
>
>> So, looks like none of us have the permissions. I filed INFRA-17167
>>  to the Infra
>> team to restart the docker on the beam15.
>>
>> Thanks.
>> Yifan
>>
>> On Mon, Oct 22, 2018 at 9:20 AM Scott Wegner 
>> wrote:
>>
>>> I've seen the docker issue pop-up on website pre-commits as well:
>>> https://issues.apache.org/jira/browse/BEAM-5783. There were also on
>>> beam15.
>>>
>>> When I searched around the internet I found lots of instances of the
>>> same error; it seems to be some unreliability in the guts of Docker [1].
>>> Perhaps restarting the VM or docker daemon could help. Does anybody have
>>> permissions to log on and try it?
>>>
>>> [1] https://github.com/moby/moby/issues/31849#issuecomment-320236354
>>>
>>> On Sun, Oct 21, 2018 at 7:13 PM Thomas Weise  wrote:
>>>
 There are two issues with
 https://builds.apache.org/job/beam_PostCommit_Python_VR_Flink/
 currently:

 1) The mentioned issue with docker on beam15 - Jason, can you
 possibly advise how to deal with it?

 2) Frequent failure due to "Segmentation fault (core dumped)", as
 exhibited by
 https://builds.apache.org/job/beam_PostCommit_Python_VR_Flink/449/consoleText

 The Gradle scan is here:


 https://scans.gradle.com/s/ebhxs4l65cow4/failure?openFailures=WzBd=WzEse31d#top=0

 There are multiple of those in sequence on beam13

 Some more comments: https://issues.apache.org/jira/browse/BEAM-5467

 Any help to further investigate or fix would be appreciated!

 Thanks,
 Thomas



 On Fri, Oct 19, 2018 at 4:51 PM Yifan Zou 
 wrote:

> I got "Failed to restart docker.service: Interactive
> authentication required" while trying to restart the docker on
> beam15.
> Does anyone have the permission to do that? Or, we need to ask
> Apache Infra for help.
>
> Thanks.
> Yifan
>
> On Fri, Oct 19, 2018 at 2:51 PM Ankur Goenka 
> wrote:
>
>> Hi,
>>
>> Can we restart docker as it seems to have fixed the issue for
>> others https://github.com/moby/moby/issues/31849 ?
>>
>> Thanks,
>> Ankur
>>
>> On Fri, Oct 19, 2018 at 1:11 PM Yifan Zou 
>> wrote:
>>
>>> Hi,
>>>
>>> The docker has been installed on all Jenkins VMs. The image
>>> build process was interrupted by a grpc connection issue.
>>>
>>> *11:02:12* Starting process 'command 'docker''. Working directory:

Re: [DISCUSS] Publish vendored dependencies independently

I actually created the subtasks by finding things shaded by at least one
module. I think each one should definitely have an on list discussion that
clarifies the target artifact, namespace, version, possible complications,
etc.

My impression is that many many modules shade only Guava. So for build time
and simplification that is a big win.

Kenn

On Tue, Oct 23, 2018, 08:16 Thomas Weise  wrote:

> +1 for separate artifacts
>
> I would request that we explicitly discuss and agree which dependencies we
> vendor though.
>
> Not everything listed in the JIRA subtasks is currently relocated.
>
> Thomas
>
>
> On Tue, Oct 23, 2018 at 8:04 AM David Morávek 
> wrote:
>
>> +1 This should improve build times a lot. It would be great if vendored
>> deps could stay in the main repository.
>>
>> D.
>>
>> On Tue, Oct 23, 2018 at 12:21 PM Maximilian Michels 
>> wrote:
>>
>>> Looks great, Kenn!
>>>
>>> > Max: what is the story behind having a separate flink-shaded repo? Did
>>> that make it easier to manage in some way?
>>>
>>> Better separation of concerns, but I don't think releasing the shaded
>>> artifacts from the main repo is a problem. I'd even prefer not to split
>>> up the repo because it makes updates to the vendored dependencies
>>> slightly easier.
>>>
>>> On 23.10.18 03:25, Kenneth Knowles wrote:
>>> > OK, I've filed https://issues.apache.org/jira/browse/BEAM-5819 to
>>> > collect sub-tasks. This has enough upsides throughout lots of areas of
>>> > the project that even though it is not glamorous it seems pretty
>>> > valuable to start on immediately. And I want to find out if there's a
>>> > pitfall lurking.
>>> >
>>> > Max: what is the story behind having a separate flink-shaded repo? Did
>>> > that make it easier to manage in some way?
>>> >
>>> > Kenn
>>> >
>>> > On Mon, Oct 22, 2018 at 2:55 AM Maximilian Michels >> > > wrote:
>>> >
>>> > +1 for publishing vendored Jars independently. It will improve
>>> build
>>> > time and ease IntelliJ integration.
>>> >
>>> > Flink also publishes shaded dependencies separately:
>>> >
>>> > - https://github.com/apache/flink-shaded
>>> > - https://issues.apache.org/jira/browse/FLINK-6529
>>> >
>>> > AFAIK their main motivation was to get rid of duplicate shaded
>>> classes
>>> > on the classpath. We don't appear to have that problem because we
>>> > already have a separate "vendor" project.
>>> >
>>> >  >  - With shading, it is hard (impossible?) to step into
>>> dependency
>>> > code in IntelliJ's debugger, because the actual symbol at runtime
>>> > does not match what is in the external jars
>>> >
>>> > This would be solved by releasing the sources of the shaded jars.
>>> >  From a
>>> > legal perspective, this could be problematic as alluded to here:
>>> > https://github.com/apache/flink-shaded/issues/25
>>> >
>>> > -Max
>>> >
>>> > On 20.10.18 01:11, Lukasz Cwik wrote:
>>> >  > I have tried several times to improve the build system and
>>> intellij
>>> >  > integration and each attempt ended with little progress when
>>> dealing
>>> >  > with vendored code. My latest attempt has been the most
>>> promising
>>> > where
>>> >  > I take the vendored classes/jars and decompile them generating
>>> the
>>> >  > source that Intellij can then use. I have a branch[1] that
>>> > demonstrates
>>> >  > the idea. It works pretty well (and up until a change where we
>>> > started
>>> >  > vendoring gRPC, was impractical to do. Instructions to try it
>>> out
>>> > are:
>>> >  >
>>> >  > // Clean up any remnants of prior builds/intellij projects
>>> >  > git clean -fdx
>>> >  > // Generated the source for vendored/shaded modules
>>> >  > ./gradlew decompile
>>> >  >
>>> >  > // Remove the "generated" Java sources for protos so they don't
>>> > conflict with the decompiled sources.
>>> >  > rm -rf model/pipeline/build/generated/source/proto
>>> >  > rm -rf model/job-management/build/generated/source/proto
>>> >  > rm -rf model/fn-execution/build/generated/source/proto
>>> >  > // Import the project into Intellij, most code completion now
>>> > works still some issues with a few classes.
>>> >  > // Note that the Java decompiler doesn't generate valid source
>>> so
>>> > still need to delegate to Gradle for build/run/test actions
>>> >  > // Other decompilers may do a better/worse job but haven't tried
>>> > them.
>>> >  >
>>> >  >
>>> >  > The problems that I face are that the generated Java source
>>> from the
>>> >  > protos and the decompiled source from the compiled version of
>>> that
>>> >  > source post shading are both being imported as content roots and
>>> > then
>>> >  > conflict. Also, the CFR decompiler isn't producing valid
>>> source, if
>>> >  > people could try others and report their mileage, we may find
>>> one
>>> > that
>>> >

Re: [DISCUSS] Publish vendored dependencies independently

+1 for separate artifacts

I would request that we explicitly discuss and agree which dependencies we
vendor though.

Not everything listed in the JIRA subtasks is currently relocated.

Thomas


On Tue, Oct 23, 2018 at 8:04 AM David Morávek 
wrote:

> +1 This should improve build times a lot. It would be great if vendored
> deps could stay in the main repository.
>
> D.
>
> On Tue, Oct 23, 2018 at 12:21 PM Maximilian Michels 
> wrote:
>
>> Looks great, Kenn!
>>
>> > Max: what is the story behind having a separate flink-shaded repo? Did
>> that make it easier to manage in some way?
>>
>> Better separation of concerns, but I don't think releasing the shaded
>> artifacts from the main repo is a problem. I'd even prefer not to split
>> up the repo because it makes updates to the vendored dependencies
>> slightly easier.
>>
>> On 23.10.18 03:25, Kenneth Knowles wrote:
>> > OK, I've filed https://issues.apache.org/jira/browse/BEAM-5819 to
>> > collect sub-tasks. This has enough upsides throughout lots of areas of
>> > the project that even though it is not glamorous it seems pretty
>> > valuable to start on immediately. And I want to find out if there's a
>> > pitfall lurking.
>> >
>> > Max: what is the story behind having a separate flink-shaded repo? Did
>> > that make it easier to manage in some way?
>> >
>> > Kenn
>> >
>> > On Mon, Oct 22, 2018 at 2:55 AM Maximilian Michels > > > wrote:
>> >
>> > +1 for publishing vendored Jars independently. It will improve build
>> > time and ease IntelliJ integration.
>> >
>> > Flink also publishes shaded dependencies separately:
>> >
>> > - https://github.com/apache/flink-shaded
>> > - https://issues.apache.org/jira/browse/FLINK-6529
>> >
>> > AFAIK their main motivation was to get rid of duplicate shaded
>> classes
>> > on the classpath. We don't appear to have that problem because we
>> > already have a separate "vendor" project.
>> >
>> >  >  - With shading, it is hard (impossible?) to step into dependency
>> > code in IntelliJ's debugger, because the actual symbol at runtime
>> > does not match what is in the external jars
>> >
>> > This would be solved by releasing the sources of the shaded jars.
>> >  From a
>> > legal perspective, this could be problematic as alluded to here:
>> > https://github.com/apache/flink-shaded/issues/25
>> >
>> > -Max
>> >
>> > On 20.10.18 01:11, Lukasz Cwik wrote:
>> >  > I have tried several times to improve the build system and
>> intellij
>> >  > integration and each attempt ended with little progress when
>> dealing
>> >  > with vendored code. My latest attempt has been the most promising
>> > where
>> >  > I take the vendored classes/jars and decompile them generating
>> the
>> >  > source that Intellij can then use. I have a branch[1] that
>> > demonstrates
>> >  > the idea. It works pretty well (and up until a change where we
>> > started
>> >  > vendoring gRPC, was impractical to do. Instructions to try it out
>> > are:
>> >  >
>> >  > // Clean up any remnants of prior builds/intellij projects
>> >  > git clean -fdx
>> >  > // Generated the source for vendored/shaded modules
>> >  > ./gradlew decompile
>> >  >
>> >  > // Remove the "generated" Java sources for protos so they don't
>> > conflict with the decompiled sources.
>> >  > rm -rf model/pipeline/build/generated/source/proto
>> >  > rm -rf model/job-management/build/generated/source/proto
>> >  > rm -rf model/fn-execution/build/generated/source/proto
>> >  > // Import the project into Intellij, most code completion now
>> > works still some issues with a few classes.
>> >  > // Note that the Java decompiler doesn't generate valid source so
>> > still need to delegate to Gradle for build/run/test actions
>> >  > // Other decompilers may do a better/worse job but haven't tried
>> > them.
>> >  >
>> >  >
>> >  > The problems that I face are that the generated Java source from
>> the
>> >  > protos and the decompiled source from the compiled version of
>> that
>> >  > source post shading are both being imported as content roots and
>> > then
>> >  > conflict. Also, the CFR decompiler isn't producing valid source,
>> if
>> >  > people could try others and report their mileage, we may find one
>> > that
>> >  > works and then we would be able to use intellij to build/run our
>> > code
>> >  > and not need to delegate all our build/run/test actions to
>> Gradle.
>> >  >
>> >  > After all these attempts I have done, vendoring the dependencies
>> > outside
>> >  > of the project seems like a sane approach and unless someone
>> > wants to
>> >  > take a stab at the best progress I have made above, I would go
>> > with what
>> >  > Kenn is suggesting even though it will mean that we will need to
>> >

Re: [DISCUSS] Publish vendored dependencies independently

2018-10-23 Thread David Morávek

+1 This should improve build times a lot. It would be great if vendored
deps could stay in the main repository.

D.

On Tue, Oct 23, 2018 at 12:21 PM Maximilian Michels  wrote:

> Looks great, Kenn!
>
> > Max: what is the story behind having a separate flink-shaded repo? Did
> that make it easier to manage in some way?
>
> Better separation of concerns, but I don't think releasing the shaded
> artifacts from the main repo is a problem. I'd even prefer not to split
> up the repo because it makes updates to the vendored dependencies
> slightly easier.
>
> On 23.10.18 03:25, Kenneth Knowles wrote:
> > OK, I've filed https://issues.apache.org/jira/browse/BEAM-5819 to
> > collect sub-tasks. This has enough upsides throughout lots of areas of
> > the project that even though it is not glamorous it seems pretty
> > valuable to start on immediately. And I want to find out if there's a
> > pitfall lurking.
> >
> > Max: what is the story behind having a separate flink-shaded repo? Did
> > that make it easier to manage in some way?
> >
> > Kenn
> >
> > On Mon, Oct 22, 2018 at 2:55 AM Maximilian Michels  > > wrote:
> >
> > +1 for publishing vendored Jars independently. It will improve build
> > time and ease IntelliJ integration.
> >
> > Flink also publishes shaded dependencies separately:
> >
> > - https://github.com/apache/flink-shaded
> > - https://issues.apache.org/jira/browse/FLINK-6529
> >
> > AFAIK their main motivation was to get rid of duplicate shaded
> classes
> > on the classpath. We don't appear to have that problem because we
> > already have a separate "vendor" project.
> >
> >  >  - With shading, it is hard (impossible?) to step into dependency
> > code in IntelliJ's debugger, because the actual symbol at runtime
> > does not match what is in the external jars
> >
> > This would be solved by releasing the sources of the shaded jars.
> >  From a
> > legal perspective, this could be problematic as alluded to here:
> > https://github.com/apache/flink-shaded/issues/25
> >
> > -Max
> >
> > On 20.10.18 01:11, Lukasz Cwik wrote:
> >  > I have tried several times to improve the build system and
> intellij
> >  > integration and each attempt ended with little progress when
> dealing
> >  > with vendored code. My latest attempt has been the most promising
> > where
> >  > I take the vendored classes/jars and decompile them generating the
> >  > source that Intellij can then use. I have a branch[1] that
> > demonstrates
> >  > the idea. It works pretty well (and up until a change where we
> > started
> >  > vendoring gRPC, was impractical to do. Instructions to try it out
> > are:
> >  >
> >  > // Clean up any remnants of prior builds/intellij projects
> >  > git clean -fdx
> >  > // Generated the source for vendored/shaded modules
> >  > ./gradlew decompile
> >  >
> >  > // Remove the "generated" Java sources for protos so they don't
> > conflict with the decompiled sources.
> >  > rm -rf model/pipeline/build/generated/source/proto
> >  > rm -rf model/job-management/build/generated/source/proto
> >  > rm -rf model/fn-execution/build/generated/source/proto
> >  > // Import the project into Intellij, most code completion now
> > works still some issues with a few classes.
> >  > // Note that the Java decompiler doesn't generate valid source so
> > still need to delegate to Gradle for build/run/test actions
> >  > // Other decompilers may do a better/worse job but haven't tried
> > them.
> >  >
> >  >
> >  > The problems that I face are that the generated Java source from
> the
> >  > protos and the decompiled source from the compiled version of that
> >  > source post shading are both being imported as content roots and
> > then
> >  > conflict. Also, the CFR decompiler isn't producing valid source,
> if
> >  > people could try others and report their mileage, we may find one
> > that
> >  > works and then we would be able to use intellij to build/run our
> > code
> >  > and not need to delegate all our build/run/test actions to Gradle.
> >  >
> >  > After all these attempts I have done, vendoring the dependencies
> > outside
> >  > of the project seems like a sane approach and unless someone
> > wants to
> >  > take a stab at the best progress I have made above, I would go
> > with what
> >  > Kenn is suggesting even though it will mean that we will need to
> > perform
> >  > releases every time we want to change the version of one of our
> > vendored
> >  > dependencies.
> >  >
> >  > 1: https://github.com/lukecwik/incubator-beam/tree/intellij
> >  >
> >  >
> >  > On Fri, Oct 19, 2018 at 10:43 AM Kenneth Knowles  > 
> >  >

Re: Docker missing on Beam15

For the latter (createProcessWorker):
https://github.com/apache/beam/pull/6793


On Tue, Oct 23, 2018 at 6:47 AM Thomas Weise  wrote:

> Thanks for taking a look Yifan. Yes, it appears this was an intermittent
> issue.
>
> For beam_PostCommit_Python_VR_Flink we are left with:
>
> * beam15 docker errors
> * segmentation faults
> * "Execution failed for task ':beam-sdks-python:createProcessWorker'" -
> which should not even execute since we are using Docker
>
>
> On Mon, Oct 22, 2018 at 10:50 PM Yifan Zou  wrote:
>
>> I'm not able to reproduce that error in Beam6 (#459
>> ,
>> #460 ),
>> it probably due to some outage of Debian [1]. The image was successfully
>> built, but the test failed in other reasons.
>> And indeed, the beam_PostCommit_Python_VR_Flink is very flaky.
>>
>> Yifan
>>
>> [1] https://github.com/docker-library/python/issues/241
>>
>> On Mon, Oct 22, 2018 at 5:39 PM Thomas Weise  wrote:
>>
>>> Looks like we have more container build related errors.
>>>
>>> This is from beam6 -
>>> https://builds.apache.org/job/beam_PostCommit_Python_VR_Flink_PR/44/
>>>
>>> Reading package lists...
>>> [91mW: The repository 'http://deb.debian.org/debian stretch Release'
>>> does not have a Release file.
>>>
>>> W: The repository 'http://deb.debian.org/debian stretch-updates Release' 
>>> does not have a Release file.
>>> E: Failed to fetch 
>>> http://deb.debian.org/debian/dists/stretch/main/binary-amd64/Packages  404  
>>> Not Found
>>> E: Failed to fetch 
>>> http://deb.debian.org/debian/dists/stretch-updates/main/binary-amd64/Packages
>>>   404  Not Found
>>> E: Some index files failed to download. They have been ignored, or old ones 
>>> used instead.
>>>
>>>
>>> On Mon, Oct 22, 2018 at 2:54 PM Ankur Goenka  wrote:
>>>
 Thanks Yifan!

 On Mon, Oct 22, 2018 at 2:53 PM Yifan Zou  wrote:

> So, looks like none of us have the permissions. I filed INFRA-17167
>  to the Infra team
> to restart the docker on the beam15.
>
> Thanks.
> Yifan
>
> On Mon, Oct 22, 2018 at 9:20 AM Scott Wegner  wrote:
>
>> I've seen the docker issue pop-up on website pre-commits as well:
>> https://issues.apache.org/jira/browse/BEAM-5783. There were also on
>> beam15.
>>
>> When I searched around the internet I found lots of instances of the
>> same error; it seems to be some unreliability in the guts of Docker [1].
>> Perhaps restarting the VM or docker daemon could help. Does anybody have
>> permissions to log on and try it?
>>
>> [1] https://github.com/moby/moby/issues/31849#issuecomment-320236354
>>
>> On Sun, Oct 21, 2018 at 7:13 PM Thomas Weise  wrote:
>>
>>> There are two issues with
>>> https://builds.apache.org/job/beam_PostCommit_Python_VR_Flink/
>>> currently:
>>>
>>> 1) The mentioned issue with docker on beam15 - Jason, can you
>>> possibly advise how to deal with it?
>>>
>>> 2) Frequent failure due to "Segmentation fault (core dumped)", as
>>> exhibited by
>>> https://builds.apache.org/job/beam_PostCommit_Python_VR_Flink/449/consoleText
>>>
>>> The Gradle scan is here:
>>>
>>>
>>> https://scans.gradle.com/s/ebhxs4l65cow4/failure?openFailures=WzBd=WzEse31d#top=0
>>>
>>> There are multiple of those in sequence on beam13
>>>
>>> Some more comments: https://issues.apache.org/jira/browse/BEAM-5467
>>>
>>> Any help to further investigate or fix would be appreciated!
>>>
>>> Thanks,
>>> Thomas
>>>
>>>
>>>
>>> On Fri, Oct 19, 2018 at 4:51 PM Yifan Zou 
>>> wrote:
>>>
 I got "Failed to restart docker.service: Interactive
 authentication required" while trying to restart the docker on
 beam15.
 Does anyone have the permission to do that? Or, we need to ask
 Apache Infra for help.

 Thanks.
 Yifan

 On Fri, Oct 19, 2018 at 2:51 PM Ankur Goenka 
 wrote:

> Hi,
>
> Can we restart docker as it seems to have fixed the issue for
> others https://github.com/moby/moby/issues/31849 ?
>
> Thanks,
> Ankur
>
> On Fri, Oct 19, 2018 at 1:11 PM Yifan Zou 
> wrote:
>
>> Hi,
>>
>> The docker has been installed on all Jenkins VMs. The image build
>> process was interrupted by a grpc connection issue.
>>
>> *11:02:12* Starting process 'command 'docker''. Working directory: 
>> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_VR_Flink/src/sdks/python/container/build/docker
>>  Command: docker build --no-cache -t 
>> jenkins-docker-apache.bintray.io/beam/python:latest .*11:02:12*

Re: Docker missing on Beam15

Thanks for taking a look Yifan. Yes, it appears this was an intermittent
issue.

For beam_PostCommit_Python_VR_Flink we are left with:

* beam15 docker errors
* segmentation faults
* "Execution failed for task ':beam-sdks-python:createProcessWorker'" -
which should not even execute since we are using Docker


On Mon, Oct 22, 2018 at 10:50 PM Yifan Zou  wrote:

> I'm not able to reproduce that error in Beam6 (#459
> , #460
> ), it
> probably due to some outage of Debian [1]. The image was successfully
> built, but the test failed in other reasons.
> And indeed, the beam_PostCommit_Python_VR_Flink is very flaky.
>
> Yifan
>
> [1] https://github.com/docker-library/python/issues/241
>
> On Mon, Oct 22, 2018 at 5:39 PM Thomas Weise  wrote:
>
>> Looks like we have more container build related errors.
>>
>> This is from beam6 -
>> https://builds.apache.org/job/beam_PostCommit_Python_VR_Flink_PR/44/
>>
>> Reading package lists...
>> [91mW: The repository 'http://deb.debian.org/debian stretch Release'
>> does not have a Release file.
>>
>> W: The repository 'http://deb.debian.org/debian stretch-updates Release' 
>> does not have a Release file.
>> E: Failed to fetch 
>> http://deb.debian.org/debian/dists/stretch/main/binary-amd64/Packages  404  
>> Not Found
>> E: Failed to fetch 
>> http://deb.debian.org/debian/dists/stretch-updates/main/binary-amd64/Packages
>>   404  Not Found
>> E: Some index files failed to download. They have been ignored, or old ones 
>> used instead.
>>
>>
>> On Mon, Oct 22, 2018 at 2:54 PM Ankur Goenka  wrote:
>>
>>> Thanks Yifan!
>>>
>>> On Mon, Oct 22, 2018 at 2:53 PM Yifan Zou  wrote:
>>>
 So, looks like none of us have the permissions. I filed INFRA-17167
  to the Infra team
 to restart the docker on the beam15.

 Thanks.
 Yifan

 On Mon, Oct 22, 2018 at 9:20 AM Scott Wegner  wrote:

> I've seen the docker issue pop-up on website pre-commits as well:
> https://issues.apache.org/jira/browse/BEAM-5783. There were also on
> beam15.
>
> When I searched around the internet I found lots of instances of the
> same error; it seems to be some unreliability in the guts of Docker [1].
> Perhaps restarting the VM or docker daemon could help. Does anybody have
> permissions to log on and try it?
>
> [1] https://github.com/moby/moby/issues/31849#issuecomment-320236354
>
> On Sun, Oct 21, 2018 at 7:13 PM Thomas Weise  wrote:
>
>> There are two issues with
>> https://builds.apache.org/job/beam_PostCommit_Python_VR_Flink/
>> currently:
>>
>> 1) The mentioned issue with docker on beam15 - Jason, can you
>> possibly advise how to deal with it?
>>
>> 2) Frequent failure due to "Segmentation fault (core dumped)", as
>> exhibited by
>> https://builds.apache.org/job/beam_PostCommit_Python_VR_Flink/449/consoleText
>>
>> The Gradle scan is here:
>>
>>
>> https://scans.gradle.com/s/ebhxs4l65cow4/failure?openFailures=WzBd=WzEse31d#top=0
>>
>> There are multiple of those in sequence on beam13
>>
>> Some more comments: https://issues.apache.org/jira/browse/BEAM-5467
>>
>> Any help to further investigate or fix would be appreciated!
>>
>> Thanks,
>> Thomas
>>
>>
>>
>> On Fri, Oct 19, 2018 at 4:51 PM Yifan Zou 
>> wrote:
>>
>>> I got "Failed to restart docker.service: Interactive authentication
>>> required" while trying to restart the docker on beam15.
>>> Does anyone have the permission to do that? Or, we need to ask
>>> Apache Infra for help.
>>>
>>> Thanks.
>>> Yifan
>>>
>>> On Fri, Oct 19, 2018 at 2:51 PM Ankur Goenka 
>>> wrote:
>>>
 Hi,

 Can we restart docker as it seems to have fixed the issue for
 others https://github.com/moby/moby/issues/31849 ?

 Thanks,
 Ankur

 On Fri, Oct 19, 2018 at 1:11 PM Yifan Zou 
 wrote:

> Hi,
>
> The docker has been installed on all Jenkins VMs. The image build
> process was interrupted by a grpc connection issue.
>
> *11:02:12* Starting process 'command 'docker''. Working directory: 
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_VR_Flink/src/sdks/python/container/build/docker
>  Command: docker build --no-cache -t 
> jenkins-docker-apache.bintray.io/beam/python:latest .*11:02:12* 
> Successfully started process 'command 'docker''*11:02:12* Sending 
> build context to Docker daemon  17.65MB
> *11:02:12* Step 1/9 : FROM python:2-stretch*11:02:12*  ---> 
> 3c43a5d4034a*11:02:12* Step 2/9 : MAINTAINER "Apache Beam 
> "*11:02:12*  --->

Re: [DISCUSS] Publish vendored dependencies independently

2018-10-23 Thread Maximilian Michels


Looks great, Kenn!


Max: what is the story behind having a separate flink-shaded repo? Did that 
make it easier to manage in some way?


Better separation of concerns, but I don't think releasing the shaded 
artifacts from the main repo is a problem. I'd even prefer not to split 
up the repo because it makes updates to the vendored dependencies 
slightly easier.


On 23.10.18 03:25, Kenneth Knowles wrote:
OK, I've filed https://issues.apache.org/jira/browse/BEAM-5819 to 
collect sub-tasks. This has enough upsides throughout lots of areas of 
the project that even though it is not glamorous it seems pretty 
valuable to start on immediately. And I want to find out if there's a 
pitfall lurking.


Max: what is the story behind having a separate flink-shaded repo? Did 
that make it easier to manage in some way?


Kenn

On Mon, Oct 22, 2018 at 2:55 AM Maximilian Michels > wrote:


+1 for publishing vendored Jars independently. It will improve build
time and ease IntelliJ integration.

Flink also publishes shaded dependencies separately:

- https://github.com/apache/flink-shaded
- https://issues.apache.org/jira/browse/FLINK-6529

AFAIK their main motivation was to get rid of duplicate shaded classes
on the classpath. We don't appear to have that problem because we
already have a separate "vendor" project.

 >  - With shading, it is hard (impossible?) to step into dependency
code in IntelliJ's debugger, because the actual symbol at runtime
does not match what is in the external jars

This would be solved by releasing the sources of the shaded jars.
 From a
legal perspective, this could be problematic as alluded to here:
https://github.com/apache/flink-shaded/issues/25

-Max

On 20.10.18 01:11, Lukasz Cwik wrote:
 > I have tried several times to improve the build system and intellij
 > integration and each attempt ended with little progress when dealing
 > with vendored code. My latest attempt has been the most promising
where
 > I take the vendored classes/jars and decompile them generating the
 > source that Intellij can then use. I have a branch[1] that
demonstrates
 > the idea. It works pretty well (and up until a change where we
started
 > vendoring gRPC, was impractical to do. Instructions to try it out
are:
 >
 > // Clean up any remnants of prior builds/intellij projects
 > git clean -fdx
 > // Generated the source for vendored/shaded modules
 > ./gradlew decompile
 >
 > // Remove the "generated" Java sources for protos so they don't
conflict with the decompiled sources.
 > rm -rf model/pipeline/build/generated/source/proto
 > rm -rf model/job-management/build/generated/source/proto
 > rm -rf model/fn-execution/build/generated/source/proto
 > // Import the project into Intellij, most code completion now
works still some issues with a few classes.
 > // Note that the Java decompiler doesn't generate valid source so
still need to delegate to Gradle for build/run/test actions
 > // Other decompilers may do a better/worse job but haven't tried
them.
 >
 >
 > The problems that I face are that the generated Java source from the
 > protos and the decompiled source from the compiled version of that
 > source post shading are both being imported as content roots and
then
 > conflict. Also, the CFR decompiler isn't producing valid source, if
 > people could try others and report their mileage, we may find one
that
 > works and then we would be able to use intellij to build/run our
code
 > and not need to delegate all our build/run/test actions to Gradle.
 >
 > After all these attempts I have done, vendoring the dependencies
outside
 > of the project seems like a sane approach and unless someone
wants to
 > take a stab at the best progress I have made above, I would go
with what
 > Kenn is suggesting even though it will mean that we will need to
perform
 > releases every time we want to change the version of one of our
vendored
 > dependencies.
 >
 > 1: https://github.com/lukecwik/incubator-beam/tree/intellij
 >
 >
 > On Fri, Oct 19, 2018 at 10:43 AM Kenneth Knowles mailto:k...@apache.org>
 > >> wrote:
 >
 >     Another reason to push on this is to get build times down.
Once only
 >     generated proto classes use the shadow plugin we'll cut the build
 >     time in ~half? And there is no reason to constantly re-vendor.
 >
 >     Kenn
 >
 >     On Fri, Oct 19, 2018 at 10:39 AM Kenneth Knowles
mailto:k...@google.com>
 >     >> wrote:
 >
 >         Hi all,
 >
 >         A while ago we had pretty good consensus that we should
vendor
 >

Re: Python docs build error

2018-10-23 Thread Maximilian Michels


It looks like now the build is broken on Jenkins but runs fine on MacOs.

There is some inconsistency in how `:pylint27` runs across the two 
platforms.


Broken build: 
https://builds.apache.org/job/beam_Release_Gradle_NightlySnapshot/216/


On 22.10.18 19:01, Ruoyun Huang wrote:

To Colm's question.

We observed this issue as well and had discussions in a separate thread 
, with Scott and Micah.


This issue was only reproduced on certain Linux environment.  MacOS does 
not have this error.  We also specifically ran the test on Jenkins, but 
could not reproduce it either.


On Mon, Oct 22, 2018 at 7:49 AM Colm O hEigeartaigh > wrote:


Great, thanks! Out of curiosity, did the jenkins job for the initial
PR not detect the build failure?

Colm.

On Mon, Oct 22, 2018 at 2:29 PM Maximilian Michels mailto:m...@apache.org>> wrote:

Correction for the footnote:

[1] https://github.com/apache/beam/pull/6637

On 22.10.18 15:24, Maximilian Michels wrote:
 > Hi Colm,
 >
 > This [1] got merged recently and broke the "docs" target which
 > apparently is not part of our Python PreCommit tests.
 >
 > See the following PR for a fix:
https://github.com/apache/beam/pull/6774
 >
 > Best,
 > Max
 >
 > [1] https://github.com/apache/beam/pull/6737
 >
 > On 22.10.18 12:55, Colm O hEigeartaigh wrote:
 >> Hi all,
 >>
 >> The following command: ./gradlew :beam-sdks-python:docs
gives me the
 >> following error:
 >>
 >>

/home/coheig/src/apache/beam/sdks/python/apache_beam/io/flink/flink_streaming_impulse_source.py:docstring

 >> of
 >>

apache_beam.io.flink.flink_streaming_impulse_source.FlinkStreamingImpulseSource.from_runner_api_parameter:11:

 >> WARNING: Unexpected indentation.
 >> Command exited with non-zero status 1
 >> 42.81user 4.02system 0:16.27elapsed 287%CPU (0avgtext+0avgdata
 >> 141036maxresident)k
 >> 0inputs+47792outputs (0major+727274minor)pagefaults 0swaps
 >> ERROR: InvocationError for command '/usr/bin/time
 >>
/home/coheig/src/apache/beam/sdks/python/scripts/generate_pydoc.sh'
 >> (exited with code 1)
 >> ___ summary
 >> 
 >> ERROR:   docs: commands failed
 >>
 >>  > Task :beam-sdks-python:docs FAILED
 >>
 >> FAILURE: Build failed with an exception.
 >>
 >> Am I missing something or is there an issue here?
 >>
 >> Thanks,
 >>
 >> Colm.
 >>
 >>
 >> --
 >> Colm O hEigeartaigh
 >>
 >> Talend Community Coder
 >> http://coders.talend.com



-- 
Colm O hEigeartaigh


Talend Community Coder
http://coders.talend.com



--

Ruoyun  Huang

Re: [PROPOSAL] Move sorting to sdks-java-core

2018-10-23 Thread Robert Bradshaw

I like the idea of asking for a coder for T with properties X. (E.g. the
order-preserving one may not be the the most efficient, so a poor default,
but required in some cases.)

Note that if we go the route of secondary-key-extraction, we don't even
need a full coder here, just an order-preserving encoding. (This has, as
mentioned, the disadvantage of shuffling possible redundancy between the
order-providing key and the actual value).

On Mon, Oct 22, 2018 at 9:46 PM Kenneth Knowles  wrote:

> A related approach to Robert's that does not involve new types is to alter
> coder inference from the current:
>
> 1. Ask for a coder for type T
> 2. Check that the coder is (order preserving / deterministic)
>
> To:
>
> 1. Ask for an order preserving coder for T / ask for a deterministic coder
> for T
>
> This would allow recursive search for a list or KV coder that is order
> preserving. This could be implemented as a parallel code path in
> CoderRegistry without other changes, and invoked by transforms, even before
> any global changes to how coders are inferred. We'd have to be careful
> about pipeline upgrade compatibility.
>
> Kenn
>
> On Mon, Oct 22, 2018 at 12:40 PM David Morávek 
> wrote:
>
>> Lukasz, you are right. I didn't think about structured coders. Thanks
>>
>> On Mon, Oct 22, 2018 at 7:40 PM Lukasz Cwik  wrote:
>>
>>> I don't believe an interface will work because KvCoder/ListCoder/...
>>> would only be order preserving if their components coders were order
>>> preserving.
>>>
>>> On Mon, Oct 22, 2018 at 8:52 AM David Morávek 
>>> wrote:
>>>
 What should be the next step? I guess we all agree that hadoop
 dependency should be splitted out. Then we're left off with the SortValues
 transform + in memory implementation. I'm ok with keeping this as a
 separate module, as this would discourage users to use sorting in their
 business logic.

 Robert:
 ad introduction of a new method for the coders. How about creating a
 new interface eg. *OrderPreservingCoder*? Than you can require this
 interface in your method signature and IDE will autocomplete all of the
 possible implementations that you can use. In case of a new method, user
 needs to now which implementations are order preserving and it can be
 really confusing. I think the same thinking should apply to other coder
 properties.

 D.



 On Thu, Oct 18, 2018 at 12:15 PM Niel Markwick 
 wrote:

> FYI: the BufferedExternalSorter depends on Hadoop client libraries
> (specifically hadoop_mapreduce_client_core and hadoop_common), but not on
> the Hadoop service -- because the  ExternalSorter
> 
> uses Hadoop's SequenceFile
> 
>  for
> on-disk sorting.
>
>
>
> On Thu, 18 Oct 2018 at 11:19 David Morávek 
> wrote:
>
>> Kenn, I believe we should not introduce hadoop dependency to neither
>> sdks or runners. We may split sorting in two packages, one with the
>> transformation + in memory implementation (this is the part I'd love to 
>> see
>> become part of sdks-java-core) and second module with more robust 
>> external
>> sorter (with hadoop dep).
>>
>> Does this make sense?
>>
>>
>> On Thu, Oct 18, 2018 at 2:03 AM Dan Halperin 
>> wrote:
>>
>>> On Wed, Oct 17, 2018 at 3:44 PM Kenneth Knowles 
>>> wrote:
>>>
 The runner can always just depend on the sorter to do it the legacy
 way by class matching; it shouldn't incur other dependency 
 penalties... but
 now that I look briefly, the sorter depends on Hadoop bits. That seems 
 a
 heavy price to pay for a user in any event. Are those Hadoop deps
 reasonably self-contained?

>>>
>>> Nice catch, Kenn! This is indeed why we didn't originally include
>>> the Sorter in core. The Hadoop deps have an enormous surface, or did at 
>>> the
>>> time.
>>>
>>> Dan
>>>
>>>

 Kenn

 On Wed, Oct 17, 2018 at 2:39 PM Lukasz Cwik 
 wrote:

> Merging the sorter into sdks-java-core isn't needed for pipelines
> executed via portability since the Runner will be able to perform
> PTransform replacement and optimization based upon the URN of the 
> transform
> and its payload so it would never need to have the "Sorter" class in 
> its
> classpath.
>
> I'm ambivalent about whether merging it now is worth it.
>
> On Wed, Oct 17, 2018 at 2:31 PM David Morávek <
> david.mora...@gmail.com> wrote:
>
>> We can always fall back to the External

Re: Build failed in Jenkins: beam_Release_Gradle_NightlySnapshot #216

2018-10-23 Thread Maximilian Michels


I don't get the error locally when running:

  gradle :beam-sdks-python:lintPy27

Seems like there is a different configuration on Jenkins?

On 23.10.18 10:16, Apache Jenkins Server wrote:

See 


Changes:

[david.moravek] [BEAM-5297] Add propdeps-idea plugin.

[25622840+adude3141] remove usage of deprecated Task.leftShift(Closure) method

[25622840+adude3141] remove usage of deprecated TaskInputs.dir() with something 
that doesn't

[25622840+adude3141] remove usage of deprecated 
FileCollection.stopExecutionIfEmpty() method

[25622840+adude3141] slightly update spotless gradle plugin to get rid of 
deprectated call to

[robertwb] [BEAM-5792] Implement Create in terms of Impulse + Reshuffle.

[robertwb] [BEAM-5791] Improve Python SDK progress counters.

[github] [BEAM-5779] Increase pubsub IT pipeline duration

[amaliujia] [BEAM-5796] Test window_end of TUMBLE, HOP, SESSION

[github] [BEAM-5617] Fix Python 3 incompatibility in pickler.

[25622840+adude3141] remove javacc bad option warning 'grammar_encoding'

[gleb] [BEAM-5675] Fix RowCoder#verifyDeterministic

[mxm] [BEAM-5707] Fix ':beam-sdks-python:docs' target

[gleb] [BEAM-5675] Simplify RowCoder#verifyDeterministic

[scott] Export Grafana testing dashboards and improve README

[mwylde] [BEAM-5797] Ensure bundle factory is always closed on dispose()

--
[...truncated 32.66 MB...]
Task ':beam-website:buildDockerImage' is not up-to-date because:
   Task has not declared any outputs despite executing actions.
Starting process 'command 'docker''. Working directory: 

 Command: docker build -t beam-website .
Successfully started process 'command 'docker''
Sending build context to Docker daemon  26.11MB
Step 1/7 : FROM ruby:2.5
2.5: Pulling from library/ruby
Digest: sha256:1952c6e03a10bf878f078ba93af5ea92fe0338ba6ad546dfa0a4a7203213f6ac
Status: Downloaded newer image for ruby:2.5
  ---> 1f6aca1e0959
Step 2/7 : WORKDIR /ruby
  ---> Using cache
  ---> 1887c501933e
Step 3/7 : RUN gem install bundler
  ---> Using cache
  ---> f05e3e3557d0
Step 4/7 : ADD Gemfile Gemfile.lock /ruby/
  ---> Using cache
  ---> 492b55665e3d
Step 5/7 : RUN bundle install --deployment --path $GEM_HOME
  ---> Using cache
  ---> 111d53a3a581
Step 6/7 : ENV LC_ALL C.UTF-8
  ---> Using cache
  ---> 7cc4fd4065c4
Step 7/7 : CMD sleep 3600
  ---> Using cache
  ---> d632a0311fe7
Successfully built d632a0311fe7
Successfully tagged beam-website:latest
:beam-website:buildDockerImage (Thread[Task worker for ':' Thread 9,5,main]) 
completed. Took 0.991 secs.
:beam-website:createDockerContainer (Thread[Task worker for ':' Thread 
9,5,main]) started.


Task :beam-website:createDockerContainer

Caching disabled for task ':beam-website:createDockerContainer': Caching has 
not been enabled for the task
Task ':beam-website:createDockerContainer' is not up-to-date because:
   Task has not declared any outputs despite executing actions.
Starting process 'command '/bin/bash''. Working directory: 
 
Command: /bin/bash -c docker create -v 
:/repo -u 
$(id -u):$(id -g)  beam-website
Successfully started process 'command '/bin/bash''
:beam-website:createDockerContainer (Thread[Task worker for ':' Thread 
9,5,main]) completed. Took 1.119 secs.
:beam-website:startDockerContainer (Thread[Task worker for ':' Thread 
9,5,main]) started.


Task :beam-website:startDockerContainer

Caching disabled for task ':beam-website:startDockerContainer': Caching has not 
been enabled for the task
Task ':beam-website:startDockerContainer' is not up-to-date because:
   Task has not declared any outputs despite executing actions.
Starting process 'command 'docker''. Working directory: 

 Command: docker start 
ba98d579d35e3e7850bbe5584354dd1ef45fee53d084fa367f0e43f14069fc6f
Successfully started process 'command 'docker''
ba98d579d35e3e7850bbe5584354dd1ef45fee53d084fa367f0e43f14069fc6f
:beam-website:startDockerContainer (Thread[Task worker for ':' Thread 
9,5,main]) completed. Took 0.304 secs.
:beam-website:buildLocalWebsite (Thread[Task worker for ':' Thread 9,5,main]) 
started.


Task :beam-website:buildLocalWebsite

Build cache key for task ':beam-website:buildLocalWebsite' is 
948e2017be85825f2d2abfa21c0edcc3
Caching disabled for task ':beam-website:buildLocalWebsite': Caching has not 
been enabled for the task
Task ':beam-website:buildLocalWebsite' is not up-to-date because:
   No history is available.
Starting process 'command 'docker''. Working directory: 
 
Command: docker exec

Build failed in Jenkins: beam_Release_Gradle_NightlySnapshot #216