Re: workspace cleanups needed on jenkins master

2019-01-03 Thread Udi Meiri
Alan, Mikhail, feel free to merge https://github.com/apache/beam/pull/7410
when ready.

On Thu, Jan 3, 2019 at 5:39 PM Mikhail Gryzykhin  wrote:

> It is not required for https://s.apache.org/beam-community-metrics .
>
> I believe that's main dash we have atm.
>
> @Alan Myrvold  Can you confirm?
>
> --Mikhail
>
> Have feedback ?
>
>
> On Thu, Jan 3, 2019 at 1:59 PM Udi Meiri  wrote:
>
>>
>>
>> On Thu, Dec 27, 2018 at 11:02 AM Ismaël Mejía  wrote:
>>
>>> Bringing this subject for awareness to dev@
>>> We are sadly part of this top.
>>> Does somebody know what this data is? And if we can clean it
>>> periodically?
>>> Can somebody with more sysadmin super powers take a look and act on this.
>>>
>>> -- Forwarded message -
>>> From: Chris Lambertus 
>>> Date: Thu, Dec 27, 2018 at 1:36 AM
>>> Subject: workspace cleanups needed on jenkins master
>>> To: 
>>>
>>>
>>> All,
>>>
>>> The Jenkins master needs to be cleaned up. Could the following
>>> projects please reduce your usage significantly by 5 January. After 5
>>> Jan Infra will be purging more aggressively and updating job
>>> configurations as needed. As a rule of thumb, we’d like to see
>>> projects retain no more than 1 week or 7 builds worth of historical
>>> data at the absolute maximum. Larger projects should retain less to
>>> avoid using up a disproportionate amount of space on the master.
>>>
>>> Some workspaces without any identifiable associated Project will be
>>> removed.
>>>
>>>
>>>
>>> 3911 GB .
>>> 275 GB ./incubator-netbeans-linux
>>> 270 GB ./pulsar-website-build
>>> 249 GB ./pulsar-master
>>> 199 GB ./Packaging
>>> 127 GB ./HBase
>>> 121 GB ./PreCommit-ZOOKEEPER-github-pr-build
>>> 107 GB ./Any23-trunk
>>> 102 GB ./incubator-netbeans-release
>>> 79 GB ./incubator-netbeans-linux-experiment
>>> 77 GB ./beam_PostCommit_Java_PVR_Flink_Batch
>>>
>>
>> Wow, this job has huge logs (400MB+).
>> https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_Java_PVR_Flink_Batch/330/console
>> A few weeks back I suggested removing the --info flag passed to Gradle.
>> I haven't done that yet, but it might help reduce the size of logs.
>> (which I assume are all stored on master?)
>>
>> Short-term, we could reduce retention back down to 14 instead of the current
>> 30
>> 
>> .
>> +Alan Myrvold  +Mikhail Gryzykhin
>>   we don't need 30 days retention any longer for test
>> dashboards, right?
>>
>>
>>> 76 GB ./HBase-1.3-JDK8
>>> 70 GB ./Jackrabbit-Oak-Windows
>>> 70 GB ./stanbol-0.12
>>> 59 GB ./HBase-Find-Flaky-Tests
>>> 54 GB ./CouchDB
>>> 51 GB ./beam_PostCommit_Java_PVR_Flink_Streaming
>>> 48 GB ./incubator-netbeans-windows
>>> 47 GB ./FlexJS
>>> 42 GB ./HBase
>>> 41 GB ./pulsar-pull-request
>>> 37 GB ./ZooKeeper_branch35_jdk8
>>> 32 GB ./HBase-Flaky-Tests
>>> 32 GB ./Atlas-master-NoTests
>>> 31 GB ./Atlas-1.0-NoTests
>>> 31 GB ./beam_PreCommit_Java_Commit
>>> 30 GB ./Zookeeper_UT_Java18
>>> 29 GB ./Phoenix-4.x-HBase-1.4
>>> 28 GB ./HBase-2.0-hadoop3-tests
>>> 27 GB ./flink-github-ci
>>> 27 GB ./ZooKeeper_branch35_jdk7
>>> 27 GB ./oodt-trunk
>>> 25 GB ./opennlp
>>> 25 GB ./Trinidad
>>> 22 GB ./Phoenix-4.x-HBase-1.3
>>> 21 GB ./ZooKeeper_Flaky_StressTest
>>> 21 GB ./Atlas-master-AllTests
>>> 21 GB ./beam_PostCommit_Java_ValidatesRunner_Flink
>>> 20 GB ./HBase-1.3-JDK7
>>> 20 GB ./PreCommit-HBASE-Build
>>> 18 GB ./hadoop-trunk-win
>>> 18 GB ./HBase-1.2-JDK7
>>> 18 GB ./HBASE-14070.HLC
>>> 18 GB ./maven-box
>>> 17 GB ./Atlas-1.0-AllTests
>>> 17 GB ./Archiva-TLP-Gitbox
>>> 17 GB ./Apache
>>> 17 GB ./Phoenix-5.x-HBase-2.0
>>> 17 GB ./Phoenix-omid2
>>> 16 GB ./Lucene-Solr-BadApples-NightlyTests-7.x
>>> 15 GB ./HBase-2.0
>>> 14 GB ./flume-trunk
>>> 14 GB ./beam_PostCommit_Java_ValidatesRunner_Samza
>>> 14 GB ./HBase-Trunk_matrix
>>> 13 GB ./commons-csv
>>> 13 GB ./HBase-Flaky-Tests-old-just-master
>>> 13 GB ./oodt-coverage
>>> 12 GB ./incubator-rya-master-with-optionals
>>> 12 GB ./Syncope-master-deploy
>>> 11 GB ./PreCommit-PHOENIX-Build
>>> 11 GB ./Stratos-Master-Nightly-Build
>>> 11 GB ./Phoenix-master
>>> 11 GB ./Hadoop-trunk-JACC
>>> 10 GB ./ctakes-trunk-package
>>> 10 GB ./FlexJS
>>> 10 GB ./Atlas-1.0-IntegrationTests
>>> 9 GB ./incubator-rya-master
>>> 9 GB ./Atlas-master-IntegrationTests
>>> 9 GB ./beam_PostCommit_Java_ValidatesRunner_Spark
>>> 9 GB ./ZooKeeper_UT_Java7
>>> 9 GB ./Qpid-Broker-J-7.0.x-TestMatrix
>>> 9 GB ./oodt-dependency-update
>>> 9 GB ./Apache
>>> 8 GB ./Struts-examples-JDK8-master
>>> 8 GB ./Phoenix-4.x-HBase-1.2
>>> 8 GB ./flume-github-pull-request
>>> 8 GB ./HBase-HBASE-14614
>>> 8 GB ./tika-trunk-jdk1.7
>>> 8 GB ./HBase-1.2-JDK8
>>> 8 GB ./HBase-1.5
>>> 7 GB ./Atlas-master-UnitTests
>>> 7 GB ./tika-2.x-windows
>>> 7 GB ./incubator-rya-master-with-optionals-pull-requests
>>> 7 GB ./Hive-trunk
>>> 7 GB ./beam_PreCommit_Java_Cron
>>> 7 GB 

Schemas in the Go SDK

2019-01-03 Thread Robert Burke
At this point I feel like the schema discussion should be a separate thread
from having a Coder Registry in Go, which was the original topic, so I'm
forking it.

It does sounds like adding Schemas to the Go SDK would be a much larger
extension than the registry.

I'm not convinced that not having a convenient registry would serve Go SDK
users (such as they exist).

The concern I have isn't so much for Ints or Doubles, but for user types
such as Protocol Buffers, but not just those. There will be some users who
prize efficiency first, and readability second. The Go SDK presently uses
JSON encoding by default which has many of the properties of schemas, but
is severely limiting for power users.


It sounds like the following are true:
1. Full use of the Schemas in the Go SDK will require FnAPI support.
* Until the FnAPI supports it, and the semantics are implemented in the
ULR, the Go SDK probably shouldn't begin to implement against it.
* This is identical to Go's lack of SplitableDoFn keeping Beam Go pipelines
from scaling or from having Cross Language IO, which is also a precursor to
BeamGo using Beam SQL.
2. The main collision between Schemas and Coders are in the event that a
given type has both defined for it: Which is being used and when?
* This seems to me more to do with being able to enable use of the
syntactic sugar or not, but we know that at construction time, by the very
use of the sugar.
* If a file wants to materialize a file encoded with the Schema, one would
need to encode that in the DoFn doing the writing somehow (eg. ForceSchema
or ForceCoder, whichever we want to make the default). This has pipeline
compatibility implications.

It's not presently possible for Go to annotate function parameters, but
something could be worked out, similarly to how SideInputs are configured
in the Go SDK. I'd be concerned about the efficiency of those operations
though, even with Generics or code generation.


On Thu, 3 Jan 2019 at 16:33 Reuven Lax  wrote:

> On Fri, Jan 4, 2019 at 1:19 AM Robert Burke  wrote:
>
>> Very interesting Reuven!
>>
>> That would be a huge readability improvement, but it would also be a
>> significant investment over my time budget to implement them on the Go side
>> correctly. I would certainly want to read your documentation before going
>> ahead.  Will the Portability FnAPI have dedicated Schema support? That
>> would certainly change things.
>>
>
> Yes, there's absolutely a plan to add schema definitions to the FnAPI.
> This is what will allow you to use SQL from BeamGo
>
>>
>> It's not clear to me how one might achieve the inversion from SchemaCoder
>> being a special casing of CustomCoder to the other way around, since a
>> field has a type, and that type needs to be encoded. Short of always
>> encoding the primitive values in the way Beam prefers, it doesn't seem to
>> allow for customizing the encoding on output, or really say anything
>> outside of the (admittedly excellent) syntactic sugar demonstrated with the
>> Java API.
>>
>
> I'm not quite sure I understand. But schemas define a fixed set of
> primitive types, and also define the encodings for those primitive types.
> If a user wants custom encoding for a primitive type, they can create a
> byte-array field and wrap that field with a Coder (this is why I said that
> todays Coders are simply special cases); this should be very rare though,
> as users rarely should care how Beam encodes a long or a double.
>
>>
>> Offhand, Schemas seem to be an alternative to pipeline construction,
>> rather than coders for value serialization, allowing manual field
>> extraction code to be omitted. They do not appear to be a fundamental
>> approach to achieve it. For example, the grouping operation still needs to
>> encode the whole of the object as a value.
>>
>
> Schemas are properties of the data - essentially a Schema is the data type
> of a PCollection. In Java Schemas are also understood by ParDo, so you can
> write a ParDo like this:
>
> @ProcessElement
> public void process(@Field("user") String userId,  @Field("country")
> String countryCode) {
> }
>
> These extra functionalities are part of the graph, but they are enabled by
> schemas.
>
>>
>> As mentioned, I'm hoping to have a solution for existing coders by
>> January's end, so waiting for your documentation doesn't work on that
>> timeline.
>>
>
> I don't think we need to wait for all the documentation to be written.
>
>
>>
>> That said, they aren't incompatible ideas as demonstrated by the Java
>> implementation. The Go SDK remains in an experimental state. We can change
>> things should the need arise in the next few months. Further, whenever 
>> Generics
>> in Go
>> 
>> crop up, the existing user surface and execution stack will need to be
>> re-written to take advantage of them anyway. That provides an opportunity
>> to invert Coder vs Schema dependence while getting a nice 

Re: workspace cleanups needed on jenkins master

2019-01-03 Thread Mikhail Gryzykhin
It is not required for https://s.apache.org/beam-community-metrics .

I believe that's main dash we have atm.

@Alan Myrvold  Can you confirm?

--Mikhail

Have feedback ?


On Thu, Jan 3, 2019 at 1:59 PM Udi Meiri  wrote:

>
>
> On Thu, Dec 27, 2018 at 11:02 AM Ismaël Mejía  wrote:
>
>> Bringing this subject for awareness to dev@
>> We are sadly part of this top.
>> Does somebody know what this data is? And if we can clean it periodically?
>> Can somebody with more sysadmin super powers take a look and act on this.
>>
>> -- Forwarded message -
>> From: Chris Lambertus 
>> Date: Thu, Dec 27, 2018 at 1:36 AM
>> Subject: workspace cleanups needed on jenkins master
>> To: 
>>
>>
>> All,
>>
>> The Jenkins master needs to be cleaned up. Could the following
>> projects please reduce your usage significantly by 5 January. After 5
>> Jan Infra will be purging more aggressively and updating job
>> configurations as needed. As a rule of thumb, we’d like to see
>> projects retain no more than 1 week or 7 builds worth of historical
>> data at the absolute maximum. Larger projects should retain less to
>> avoid using up a disproportionate amount of space on the master.
>>
>> Some workspaces without any identifiable associated Project will be
>> removed.
>>
>>
>>
>> 3911 GB .
>> 275 GB ./incubator-netbeans-linux
>> 270 GB ./pulsar-website-build
>> 249 GB ./pulsar-master
>> 199 GB ./Packaging
>> 127 GB ./HBase
>> 121 GB ./PreCommit-ZOOKEEPER-github-pr-build
>> 107 GB ./Any23-trunk
>> 102 GB ./incubator-netbeans-release
>> 79 GB ./incubator-netbeans-linux-experiment
>> 77 GB ./beam_PostCommit_Java_PVR_Flink_Batch
>>
>
> Wow, this job has huge logs (400MB+).
> https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_Java_PVR_Flink_Batch/330/console
> A few weeks back I suggested removing the --info flag passed to Gradle.
> I haven't done that yet, but it might help reduce the size of logs. (which
> I assume are all stored on master?)
>
> Short-term, we could reduce retention back down to 14 instead of the current
> 30
> 
> .
> +Alan Myrvold  +Mikhail Gryzykhin   we
> don't need 30 days retention any longer for test dashboards, right?
>
>
>> 76 GB ./HBase-1.3-JDK8
>> 70 GB ./Jackrabbit-Oak-Windows
>> 70 GB ./stanbol-0.12
>> 59 GB ./HBase-Find-Flaky-Tests
>> 54 GB ./CouchDB
>> 51 GB ./beam_PostCommit_Java_PVR_Flink_Streaming
>> 48 GB ./incubator-netbeans-windows
>> 47 GB ./FlexJS
>> 42 GB ./HBase
>> 41 GB ./pulsar-pull-request
>> 37 GB ./ZooKeeper_branch35_jdk8
>> 32 GB ./HBase-Flaky-Tests
>> 32 GB ./Atlas-master-NoTests
>> 31 GB ./Atlas-1.0-NoTests
>> 31 GB ./beam_PreCommit_Java_Commit
>> 30 GB ./Zookeeper_UT_Java18
>> 29 GB ./Phoenix-4.x-HBase-1.4
>> 28 GB ./HBase-2.0-hadoop3-tests
>> 27 GB ./flink-github-ci
>> 27 GB ./ZooKeeper_branch35_jdk7
>> 27 GB ./oodt-trunk
>> 25 GB ./opennlp
>> 25 GB ./Trinidad
>> 22 GB ./Phoenix-4.x-HBase-1.3
>> 21 GB ./ZooKeeper_Flaky_StressTest
>> 21 GB ./Atlas-master-AllTests
>> 21 GB ./beam_PostCommit_Java_ValidatesRunner_Flink
>> 20 GB ./HBase-1.3-JDK7
>> 20 GB ./PreCommit-HBASE-Build
>> 18 GB ./hadoop-trunk-win
>> 18 GB ./HBase-1.2-JDK7
>> 18 GB ./HBASE-14070.HLC
>> 18 GB ./maven-box
>> 17 GB ./Atlas-1.0-AllTests
>> 17 GB ./Archiva-TLP-Gitbox
>> 17 GB ./Apache
>> 17 GB ./Phoenix-5.x-HBase-2.0
>> 17 GB ./Phoenix-omid2
>> 16 GB ./Lucene-Solr-BadApples-NightlyTests-7.x
>> 15 GB ./HBase-2.0
>> 14 GB ./flume-trunk
>> 14 GB ./beam_PostCommit_Java_ValidatesRunner_Samza
>> 14 GB ./HBase-Trunk_matrix
>> 13 GB ./commons-csv
>> 13 GB ./HBase-Flaky-Tests-old-just-master
>> 13 GB ./oodt-coverage
>> 12 GB ./incubator-rya-master-with-optionals
>> 12 GB ./Syncope-master-deploy
>> 11 GB ./PreCommit-PHOENIX-Build
>> 11 GB ./Stratos-Master-Nightly-Build
>> 11 GB ./Phoenix-master
>> 11 GB ./Hadoop-trunk-JACC
>> 10 GB ./ctakes-trunk-package
>> 10 GB ./FlexJS
>> 10 GB ./Atlas-1.0-IntegrationTests
>> 9 GB ./incubator-rya-master
>> 9 GB ./Atlas-master-IntegrationTests
>> 9 GB ./beam_PostCommit_Java_ValidatesRunner_Spark
>> 9 GB ./ZooKeeper_UT_Java7
>> 9 GB ./Qpid-Broker-J-7.0.x-TestMatrix
>> 9 GB ./oodt-dependency-update
>> 9 GB ./Apache
>> 8 GB ./Struts-examples-JDK8-master
>> 8 GB ./Phoenix-4.x-HBase-1.2
>> 8 GB ./flume-github-pull-request
>> 8 GB ./HBase-HBASE-14614
>> 8 GB ./tika-trunk-jdk1.7
>> 8 GB ./HBase-1.2-JDK8
>> 8 GB ./HBase-1.5
>> 7 GB ./Atlas-master-UnitTests
>> 7 GB ./tika-2.x-windows
>> 7 GB ./incubator-rya-master-with-optionals-pull-requests
>> 7 GB ./Hive-trunk
>> 7 GB ./beam_PreCommit_Java_Cron
>> 7 GB ./Atlas-1.0-UnitTests
>> 6 GB ./Jackrabbit
>> 6 GB ./beam_PostCommit_Java_PVR_Flink_PR
>> 6 GB ./Lucene-Solr-Clover-master
>> 6 GB ./Syncope-2_0_X-deploy
>> 6 GB ./beam_PostCommit_Java_ValidatesRunner_Apex
>> 6 GB ./Tika-trunk
>> 6 GB ./pirk
>> 6 GB ./Syncope-2_1_X-deploy
>> 6 GB ./PLC4X
>> 6 GB 

Re: Add code quality checks to pre-commits.

2019-01-03 Thread Heejong Lee
I think we should also consider false positive ratio of the tool.
Oftentimes deeper analysis easily produces tons of false positives which
make people less interested in static analysis results because of triaging
overheads.

On Thu, Jan 3, 2019 at 4:18 PM Scott Wegner  wrote:

> Discussion on software engineering practices and tools tends to gather
> many opinions :) I suggest breaking this out into a doc to keep the
> discussion organized.
>
> I appreciate that you've started with a list of requirements. I would add
> a few:
>
> 6. Analysis results should be integrated into the code review workflow.
> 7. It should also be possible to run analysis and evaluate results locally.
> 8. Analysis rules and thresholds should be easily configurable.
>
> And some thoughts on the previous requirements:
>
> > 2. Tool should keep history of reports.
>
> Seems nice-to-have but not required. I believe the most value is viewing
> the delta during code review, and also maybe a snapshot of the overall
> state of master. If we want trends we could also import data into
> s.apache.org/beam-community-metrics
>
> > 4. Tool should encorporate code coverage and static analysis reports.
> (Or more if applicable)
>
> Is the idea to have a single tool responsible for all code analysis? We
> currently have a variety of tools running in our build. It would be
> challenging to find a single tool that aggregates all current (and future)
> analysis, especially considering the different language ecosystems. Having
> targeted tools responsible for different pieces allows us to
> pick-and-choose what works best for Beam.
>
>
> On Thu, Jan 3, 2019 at 3:43 PM Mikhail Gryzykhin 
> wrote:
>
>> Let me summarize and answer main question that I see:
>> 1. Seems that we do want to have some statistics on coverage and
>> integrate automatic requirements into our build system.
>> 2. Implementation is still to be discussed.
>>
>> Lets talk about implementation further.
>>
>> My requirements for choice are:
>> 1. Tool should give us an option for deep-dive into findings.
>> 2. Tool should keep history of reports.
>> 3. Tool should give an option to break build (allow for hardcoded
>> requirements)
>> 4. Tool should encorporate code coverage and static analysis reports. (Or
>> more if applicable)
>> 5. Tool should support most or all languages we utilize in beam.
>>
>> Let me dive into SonarQube a bit first. (All up to my understanding of
>> how it works.)
>> Hits most of the points, potentially with some tweaks.
>> This tool relies on reports generated by common tools. It also tracks
>> history of builds and allows to navigate it. Multi language. I'm still
>> working on figuring out how to configure it though.
>>
>> Common thresholds/checks that are suggested by SonarQube:
>> Many checks are possible to apply to new code only. This allows not to
>> fix legacy code, but keep all new additions clean and neat (ish).
>> Test coverage by line/branch: Relies on cubertura report. Usually
>> coverage by branch is suggested. (all "if" case lines should be tested with
>> positive and negative condition result)
>> Method complexity: Amount of different paths/conditions that method can
>> be invoked with. Suggested max number is 15. Generally describes how easy
>> it is to test/understand method.
>> Bugs/vulnerabilities: Generally, output of Findbug. Reflects commonly
>> vulnerable/dangerous code that might cause errors. Or just errors in code.
>> I believe that sonar allows for custom code analysis as well, but that is
>> not required.
>> Technical debt: estimations on how much time will it take to cleanup code
>> to make it shiny. Includes code duplications, commented code, not following
>> naming conventions, long methods, ifs that can be inverted, public methods
>> that can be private, etc. I'm not familiar with explicit list, but on my
>> experience suggestions are usually relevant.
>> More on metrics can be found here:
>> https://docs.sonarqube.org/latest/user-guide/metric-definitions/
>>
>> Suggested alternatives:
>> https://scan.coverity.com/
>> This tool looks great and I'll check more on it. But it has a restriction
>> to 14 or 7 builds per week (not sure how will they estimate our project).
>> Also, I'm not sure if we can break pre-commit based on report from
>> coverity. Looks good for generating historical data.
>>
>> https://docs.codecov.io/docs/browser-extension
>> I'll check more on this one. Looks great to have it integrated in PRs.
>> Although it requires plugin installation by each developer. I don't think
>> it allows to break builds and only does coverage. Am I correct?
>>
>> Regards,
>> --Mikhail
>>
>> Have feedback ?
>>
>> On Thu, Jan 3, 2019 at 2:18 PM Kenneth Knowles  wrote:
>>
>>> It would be very useful to have line and/or branch coverage visible.
>>> These are both very weak proxies for quality or reliability, so IMO strict
>>> thresholds are not helpful. One thing that is super useful is to integrate
>>> 

Re: [Go SDK] User Defined Coders

2019-01-03 Thread Reuven Lax
On Fri, Jan 4, 2019 at 1:19 AM Robert Burke  wrote:

> Very interesting Reuven!
>
> That would be a huge readability improvement, but it would also be a
> significant investment over my time budget to implement them on the Go side
> correctly. I would certainly want to read your documentation before going
> ahead.  Will the Portability FnAPI have dedicated Schema support? That
> would certainly change things.
>

Yes, there's absolutely a plan to add schema definitions to the FnAPI. This
is what will allow you to use SQL from BeamGo

>
> It's not clear to me how one might achieve the inversion from SchemaCoder
> being a special casing of CustomCoder to the other way around, since a
> field has a type, and that type needs to be encoded. Short of always
> encoding the primitive values in the way Beam prefers, it doesn't seem to
> allow for customizing the encoding on output, or really say anything
> outside of the (admittedly excellent) syntactic sugar demonstrated with the
> Java API.
>

I'm not quite sure I understand. But schemas define a fixed set of
primitive types, and also define the encodings for those primitive types.
If a user wants custom encoding for a primitive type, they can create a
byte-array field and wrap that field with a Coder (this is why I said that
todays Coders are simply special cases); this should be very rare though,
as users rarely should care how Beam encodes a long or a double.

>
> Offhand, Schemas seem to be an alternative to pipeline construction,
> rather than coders for value serialization, allowing manual field
> extraction code to be omitted. They do not appear to be a fundamental
> approach to achieve it. For example, the grouping operation still needs to
> encode the whole of the object as a value.
>

Schemas are properties of the data - essentially a Schema is the data type
of a PCollection. In Java Schemas are also understood by ParDo, so you can
write a ParDo like this:

@ProcessElement
public void process(@Field("user") String userId,  @Field("country") String
countryCode) {
}

These extra functionalities are part of the graph, but they are enabled by
schemas.

>
> As mentioned, I'm hoping to have a solution for existing coders by
> January's end, so waiting for your documentation doesn't work on that
> timeline.
>

I don't think we need to wait for all the documentation to be written.


>
> That said, they aren't incompatible ideas as demonstrated by the Java
> implementation. The Go SDK remains in an experimental state. We can change
> things should the need arise in the next few months. Further, whenever 
> Generics
> in Go
> 
> crop up, the existing user surface and execution stack will need to be
> re-written to take advantage of them anyway. That provides an opportunity
> to invert Coder vs Schema dependence while getting a nice performance
> boost, and cleaner code (and deleting much of my code generator).
>
> 
>
> Were I to implement schemas to get the same syntatic benefits as the Java
> API, I'd be leveraging the field annotations Go has. This satisfies the
> protocol buffer issue as well, since generated go protos have name & json
> annotations. Schemas could be extracted that way. These are also available
> to anything using static analysis for more direct generation of accessors.
> The reflective approach would also work, which is excellent for development
> purposes.
>
> The rote code that the schemas were replacing would be able to be cobbled
> together into efficient DoFn and CombineFns for serialization. At present,
> it seems like it could be implemented as a side package that uses beam,
> rather than changing portions of the core beam Go packages, The real trick
> would be to do so without "apply" since that's not how the Go SDK is shaped.
>
>
>
>
> On Thu, 3 Jan 2019 at 15:34 Gleb Kanterov  wrote:
>
>> Reuven, it sounds great. I see there is a similar thing to Row coders
>> happening in Apache Arrow , and there is a
>> similarity between Apache Arrow Flight
>> 
>> and data exchange service in portability. How do you see these two things
>> relate to each other in the long term?
>>
>> On Fri, Jan 4, 2019 at 12:13 AM Reuven Lax  wrote:
>>
>>> The biggest advantage is actually readability and usability. A secondary
>>> advantage is that it means that Go will be able to interact seamlessly with
>>> BeamSQL, which would be a big win for Go.
>>>
>>> A schema is basically a way of saying that a record has a specific set
>>> of (possibly nested, possibly repeated) fields. So for instance let's say
>>> that the user's type is a struct with fields named user, country,
>>> purchaseCost. This allows us to provide transforms that operate on field
>>> names. Some example (using the Java API):
>>>
>>> PCollection users = events.apply(Select.fields("user"));  // Select out

Re: [Go SDK] User Defined Coders

2019-01-03 Thread Robert Burke
Very interesting Reuven!

That would be a huge readability improvement, but it would also be a
significant investment over my time budget to implement them on the Go side
correctly. I would certainly want to read your documentation before going
ahead.  Will the Portability FnAPI have dedicated Schema support? That
would certainly change things.

It's not clear to me how one might achieve the inversion from SchemaCoder
being a special casing of CustomCoder to the other way around, since a
field has a type, and that type needs to be encoded. Short of always
encoding the primitive values in the way Beam prefers, it doesn't seem to
allow for customizing the encoding on output, or really say anything
outside of the (admittedly excellent) syntactic sugar demonstrated with the
Java API.

Offhand, Schemas seem to be an alternative to pipeline construction, rather
than coders for value serialization, allowing manual field extraction code
to be omitted. They do not appear to be a fundamental approach to achieve
it. For example, the grouping operation still needs to encode the whole of
the object as a value.

As mentioned, I'm hoping to have a solution for existing coders by
January's end, so waiting for your documentation doesn't work on that
timeline.

That said, they aren't incompatible ideas as demonstrated by the Java
implementation. The Go SDK remains in an experimental state. We can change
things should the need arise in the next few months. Further, whenever Generics
in Go

crop up, the existing user surface and execution stack will need to be
re-written to take advantage of them anyway. That provides an opportunity
to invert Coder vs Schema dependence while getting a nice performance
boost, and cleaner code (and deleting much of my code generator).



Were I to implement schemas to get the same syntatic benefits as the Java
API, I'd be leveraging the field annotations Go has. This satisfies the
protocol buffer issue as well, since generated go protos have name & json
annotations. Schemas could be extracted that way. These are also available
to anything using static analysis for more direct generation of accessors.
The reflective approach would also work, which is excellent for development
purposes.

The rote code that the schemas were replacing would be able to be cobbled
together into efficient DoFn and CombineFns for serialization. At present,
it seems like it could be implemented as a side package that uses beam,
rather than changing portions of the core beam Go packages, The real trick
would be to do so without "apply" since that's not how the Go SDK is shaped.




On Thu, 3 Jan 2019 at 15:34 Gleb Kanterov  wrote:

> Reuven, it sounds great. I see there is a similar thing to Row coders
> happening in Apache Arrow , and there is a
> similarity between Apache Arrow Flight
> 
> and data exchange service in portability. How do you see these two things
> relate to each other in the long term?
>
> On Fri, Jan 4, 2019 at 12:13 AM Reuven Lax  wrote:
>
>> The biggest advantage is actually readability and usability. A secondary
>> advantage is that it means that Go will be able to interact seamlessly with
>> BeamSQL, which would be a big win for Go.
>>
>> A schema is basically a way of saying that a record has a specific set of
>> (possibly nested, possibly repeated) fields. So for instance let's say that
>> the user's type is a struct with fields named user, country, purchaseCost.
>> This allows us to provide transforms that operate on field names. Some
>> example (using the Java API):
>>
>> PCollection users = events.apply(Select.fields("user"));  // Select out
>> only the user field.
>>
>> PCollection joinedEvents =
>> queries.apply(Join.innerJoin(clicks).byFields("user"));  // Join two
>> PCollections by user.
>>
>> // For each country, calculate the total purchase cost as well as the top
>> 10 purchases.
>> // A new schema is created containing fields total_cost and
>> top_purchases, and rows are created with the aggregation results.
>> PCollection purchaseStatistics = events.apply(
>> Group.byFieldNames("country")
>>.aggregateField("purchaseCost", Sum.ofLongs(),
>> "total_cost"))
>> .aggregateField("purchaseCost", Top.largestLongs(10),
>> "top_purchases"))
>>
>>
>> This is far more readable than what we have today, and what unlocks this
>> is that Beam actually knows the structure of the record instead of assuming
>> records are uncrackable blobs.
>>
>> Note that a coder is basically a special case of a schema that has a
>> single field.
>>
>> In BeamJava we have a SchemaRegistry which knows how to turn user types
>> into schemas. We use reflection to analyze many user types (e.g. simple
>> POJO structs, JavaBean classes, Avro records, protocol buffers, etc.) to
>> determine the 

Re: Add code quality checks to pre-commits.

2019-01-03 Thread Scott Wegner
Discussion on software engineering practices and tools tends to gather many
opinions :) I suggest breaking this out into a doc to keep the discussion
organized.

I appreciate that you've started with a list of requirements. I would add a
few:

6. Analysis results should be integrated into the code review workflow.
7. It should also be possible to run analysis and evaluate results locally.
8. Analysis rules and thresholds should be easily configurable.

And some thoughts on the previous requirements:

> 2. Tool should keep history of reports.

Seems nice-to-have but not required. I believe the most value is viewing
the delta during code review, and also maybe a snapshot of the overall
state of master. If we want trends we could also import data into
s.apache.org/beam-community-metrics

> 4. Tool should encorporate code coverage and static analysis reports. (Or
more if applicable)

Is the idea to have a single tool responsible for all code analysis? We
currently have a variety of tools running in our build. It would be
challenging to find a single tool that aggregates all current (and future)
analysis, especially considering the different language ecosystems. Having
targeted tools responsible for different pieces allows us to
pick-and-choose what works best for Beam.


On Thu, Jan 3, 2019 at 3:43 PM Mikhail Gryzykhin  wrote:

> Let me summarize and answer main question that I see:
> 1. Seems that we do want to have some statistics on coverage and integrate
> automatic requirements into our build system.
> 2. Implementation is still to be discussed.
>
> Lets talk about implementation further.
>
> My requirements for choice are:
> 1. Tool should give us an option for deep-dive into findings.
> 2. Tool should keep history of reports.
> 3. Tool should give an option to break build (allow for hardcoded
> requirements)
> 4. Tool should encorporate code coverage and static analysis reports. (Or
> more if applicable)
> 5. Tool should support most or all languages we utilize in beam.
>
> Let me dive into SonarQube a bit first. (All up to my understanding of how
> it works.)
> Hits most of the points, potentially with some tweaks.
> This tool relies on reports generated by common tools. It also tracks
> history of builds and allows to navigate it. Multi language. I'm still
> working on figuring out how to configure it though.
>
> Common thresholds/checks that are suggested by SonarQube:
> Many checks are possible to apply to new code only. This allows not to fix
> legacy code, but keep all new additions clean and neat (ish).
> Test coverage by line/branch: Relies on cubertura report. Usually coverage
> by branch is suggested. (all "if" case lines should be tested with positive
> and negative condition result)
> Method complexity: Amount of different paths/conditions that method can be
> invoked with. Suggested max number is 15. Generally describes how easy it
> is to test/understand method.
> Bugs/vulnerabilities: Generally, output of Findbug. Reflects commonly
> vulnerable/dangerous code that might cause errors. Or just errors in code.
> I believe that sonar allows for custom code analysis as well, but that is
> not required.
> Technical debt: estimations on how much time will it take to cleanup code
> to make it shiny. Includes code duplications, commented code, not following
> naming conventions, long methods, ifs that can be inverted, public methods
> that can be private, etc. I'm not familiar with explicit list, but on my
> experience suggestions are usually relevant.
> More on metrics can be found here:
> https://docs.sonarqube.org/latest/user-guide/metric-definitions/
>
> Suggested alternatives:
> https://scan.coverity.com/
> This tool looks great and I'll check more on it. But it has a restriction
> to 14 or 7 builds per week (not sure how will they estimate our project).
> Also, I'm not sure if we can break pre-commit based on report from
> coverity. Looks good for generating historical data.
>
> https://docs.codecov.io/docs/browser-extension
> I'll check more on this one. Looks great to have it integrated in PRs.
> Although it requires plugin installation by each developer. I don't think
> it allows to break builds and only does coverage. Am I correct?
>
> Regards,
> --Mikhail
>
> Have feedback ?
>
> On Thu, Jan 3, 2019 at 2:18 PM Kenneth Knowles  wrote:
>
>> It would be very useful to have line and/or branch coverage visible.
>> These are both very weak proxies for quality or reliability, so IMO strict
>> thresholds are not helpful. One thing that is super useful is to integrate
>> line coverage into code review, like this:
>> https://docs.codecov.io/docs/browser-extension. It is very easy to
>> notice major missing tests.
>>
>> We have never really used Sonarqube. It was turned on as a possibility in
>> the early days but never worked on past that point. Could be nice. I
>> suspect there's a lot to be gained by just finding very low numbers and
>> improving them. So just running 

Re: [Go SDK] User Defined Coders

2019-01-03 Thread Reuven Lax
I looked at Apache Arrow as a potential serialization format for Row
coders. At the time it didn't seem a perfect fit - Beam's programming model
is record-at-a-time, and Arrow is optimized for large batches of records
(while Beam has a concept of "bundles" they are completely non
deterministic, and records might bundle different on retry). You could use
Arrow with single-record batches, but I suspect that would end up adding a
lot of extra overhead. That being said, I think it's still something worth
investigating further.

Reuven



On Fri, Jan 4, 2019 at 12:34 AM Gleb Kanterov  wrote:

> Reuven, it sounds great. I see there is a similar thing to Row coders
> happening in Apache Arrow , and there is a
> similarity between Apache Arrow Flight
> 
> and data exchange service in portability. How do you see these two things
> relate to each other in the long term?
>
> On Fri, Jan 4, 2019 at 12:13 AM Reuven Lax  wrote:
>
>> The biggest advantage is actually readability and usability. A secondary
>> advantage is that it means that Go will be able to interact seamlessly with
>> BeamSQL, which would be a big win for Go.
>>
>> A schema is basically a way of saying that a record has a specific set of
>> (possibly nested, possibly repeated) fields. So for instance let's say that
>> the user's type is a struct with fields named user, country, purchaseCost.
>> This allows us to provide transforms that operate on field names. Some
>> example (using the Java API):
>>
>> PCollection users = events.apply(Select.fields("user"));  // Select out
>> only the user field.
>>
>> PCollection joinedEvents =
>> queries.apply(Join.innerJoin(clicks).byFields("user"));  // Join two
>> PCollections by user.
>>
>> // For each country, calculate the total purchase cost as well as the top
>> 10 purchases.
>> // A new schema is created containing fields total_cost and
>> top_purchases, and rows are created with the aggregation results.
>> PCollection purchaseStatistics = events.apply(
>> Group.byFieldNames("country")
>>.aggregateField("purchaseCost", Sum.ofLongs(),
>> "total_cost"))
>> .aggregateField("purchaseCost", Top.largestLongs(10),
>> "top_purchases"))
>>
>>
>> This is far more readable than what we have today, and what unlocks this
>> is that Beam actually knows the structure of the record instead of assuming
>> records are uncrackable blobs.
>>
>> Note that a coder is basically a special case of a schema that has a
>> single field.
>>
>> In BeamJava we have a SchemaRegistry which knows how to turn user types
>> into schemas. We use reflection to analyze many user types (e.g. simple
>> POJO structs, JavaBean classes, Avro records, protocol buffers, etc.) to
>> determine the schema, however this is done only when the graph is initially
>> generated. We do use code generation (in Java we do bytecode generation) to
>> make this somewhat more efficient. I'm willing to bet that the code
>> generator you've written for structs could be very easily modified for
>> schemas instead, so it would not be wasted work if we went with schemas.
>>
>> One of the things I'm working on now is documenting Beam schemas. They
>> are already very powerful and useful, but since there is still nothing in
>> our documentation about them, they are not yet widely used. I expect to
>> finish draft documentation by the end of January.
>>
>> Reuven
>>
>> On Thu, Jan 3, 2019 at 11:32 PM Robert Burke  wrote:
>>
>>> That's an interesting idea. I must confess I don't rightly know the
>>> difference between a schema and coder, but here's what I've got with a bit
>>> of searching through memory and the mailing list. Please let me know if I'm
>>> off track.
>>>
>>> As near as I can tell, a schema, as far as Beam takes it
>>> 
>>>  is
>>> a mechanism to define what data is extracted from a given row of data. So
>>> in principle, there's an opportunity to be more efficient with data with
>>> many columns that aren't being used, and only extract the data that's
>>> meaningful to the pipeline.
>>> The trick then is how to apply the schema to a given serialization
>>> format, which is something I'm missing in my mental model (and then how to
>>> do it efficiently in Go).
>>>
>>> I do know that the Go client package for BigQuery
>>>  does
>>> something like that, using field tags. Similarly, the "encoding/json"
>>>  package in the Go
>>> Standard Library permits annotating fields and it will read out and
>>> deserialize the JSON fields and that's it.
>>>
>>> A concern I have is that Go (at present) would require pre-compile time
>>> code generation for schemas to be efficient, and 

Re: Add code quality checks to pre-commits.

2019-01-03 Thread Mikhail Gryzykhin
Let me summarize and answer main question that I see:
1. Seems that we do want to have some statistics on coverage and integrate
automatic requirements into our build system.
2. Implementation is still to be discussed.

Lets talk about implementation further.

My requirements for choice are:
1. Tool should give us an option for deep-dive into findings.
2. Tool should keep history of reports.
3. Tool should give an option to break build (allow for hardcoded
requirements)
4. Tool should encorporate code coverage and static analysis reports. (Or
more if applicable)
5. Tool should support most or all languages we utilize in beam.

Let me dive into SonarQube a bit first. (All up to my understanding of how
it works.)
Hits most of the points, potentially with some tweaks.
This tool relies on reports generated by common tools. It also tracks
history of builds and allows to navigate it. Multi language. I'm still
working on figuring out how to configure it though.

Common thresholds/checks that are suggested by SonarQube:
Many checks are possible to apply to new code only. This allows not to fix
legacy code, but keep all new additions clean and neat (ish).
Test coverage by line/branch: Relies on cubertura report. Usually coverage
by branch is suggested. (all "if" case lines should be tested with positive
and negative condition result)
Method complexity: Amount of different paths/conditions that method can be
invoked with. Suggested max number is 15. Generally describes how easy it
is to test/understand method.
Bugs/vulnerabilities: Generally, output of Findbug. Reflects commonly
vulnerable/dangerous code that might cause errors. Or just errors in code.
I believe that sonar allows for custom code analysis as well, but that is
not required.
Technical debt: estimations on how much time will it take to cleanup code
to make it shiny. Includes code duplications, commented code, not following
naming conventions, long methods, ifs that can be inverted, public methods
that can be private, etc. I'm not familiar with explicit list, but on my
experience suggestions are usually relevant.
More on metrics can be found here:
https://docs.sonarqube.org/latest/user-guide/metric-definitions/

Suggested alternatives:
https://scan.coverity.com/
This tool looks great and I'll check more on it. But it has a restriction
to 14 or 7 builds per week (not sure how will they estimate our project).
Also, I'm not sure if we can break pre-commit based on report from
coverity. Looks good for generating historical data.

https://docs.codecov.io/docs/browser-extension
I'll check more on this one. Looks great to have it integrated in PRs.
Although it requires plugin installation by each developer. I don't think
it allows to break builds and only does coverage. Am I correct?

Regards,
--Mikhail

Have feedback ?

On Thu, Jan 3, 2019 at 2:18 PM Kenneth Knowles  wrote:

> It would be very useful to have line and/or branch coverage visible. These
> are both very weak proxies for quality or reliability, so IMO strict
> thresholds are not helpful. One thing that is super useful is to integrate
> line coverage into code review, like this:
> https://docs.codecov.io/docs/browser-extension. It is very easy to notice
> major missing tests.
>
> We have never really used Sonarqube. It was turned on as a possibility in
> the early days but never worked on past that point. Could be nice. I
> suspect there's a lot to be gained by just finding very low numbers and
> improving them. So just running Jacoco's offline HTML generation would do
> it (also this integrates with Jenkins). I tried this the other day and
> discovered that our gradle config is broken and does not wire tests and
> coverage reporting together properly. Last thing: How is "technical debt"
> measured? I'm skeptical of quantitative measures for qualitative notions.
>
> Kenn
>
> On Thu, Jan 3, 2019 at 1:58 PM Heejong Lee  wrote:
>
>> I don't have any experience of using SonarQube but Coverity worked well
>> for me. Looks like it already has beam repo:
>> https://scan.coverity.com/projects/11881
>>
>> On Thu, Jan 3, 2019 at 1:27 PM Reuven Lax  wrote:
>>
>>> checkstyle and findbugs are already run as precommit checks, are they
>>> not?
>>>
>>> On Thu, Jan 3, 2019 at 7:19 PM Mikhail Gryzykhin 
>>> wrote:
>>>
 Hi everyone,

 In our current builds we (can) run multiple code quality checks tools
 like checkstyle, findbugs, code test coverage via cubertura. However we do
 not utilize many of those signals.

 I suggest to add requirements to code based on those tools.
 Specifically, I suggest to add pre-commit checks that will require PRs to
 conform to some quality checks.

 We can see good example of thresholds to add at Apache SonarQube
 provided default quality gate config
 :
 80% tests coverage on new code,
 5% technical technical debt on new code,
 No 

Re: [Go SDK] User Defined Coders

2019-01-03 Thread Gleb Kanterov
Reuven, it sounds great. I see there is a similar thing to Row coders
happening in Apache Arrow , and there is a
similarity between Apache Arrow Flight

and data exchange service in portability. How do you see these two things
relate to each other in the long term?

On Fri, Jan 4, 2019 at 12:13 AM Reuven Lax  wrote:

> The biggest advantage is actually readability and usability. A secondary
> advantage is that it means that Go will be able to interact seamlessly with
> BeamSQL, which would be a big win for Go.
>
> A schema is basically a way of saying that a record has a specific set of
> (possibly nested, possibly repeated) fields. So for instance let's say that
> the user's type is a struct with fields named user, country, purchaseCost.
> This allows us to provide transforms that operate on field names. Some
> example (using the Java API):
>
> PCollection users = events.apply(Select.fields("user"));  // Select out
> only the user field.
>
> PCollection joinedEvents =
> queries.apply(Join.innerJoin(clicks).byFields("user"));  // Join two
> PCollections by user.
>
> // For each country, calculate the total purchase cost as well as the top
> 10 purchases.
> // A new schema is created containing fields total_cost and top_purchases,
> and rows are created with the aggregation results.
> PCollection purchaseStatistics = events.apply(
> Group.byFieldNames("country")
>.aggregateField("purchaseCost", Sum.ofLongs(),
> "total_cost"))
> .aggregateField("purchaseCost", Top.largestLongs(10),
> "top_purchases"))
>
>
> This is far more readable than what we have today, and what unlocks this
> is that Beam actually knows the structure of the record instead of assuming
> records are uncrackable blobs.
>
> Note that a coder is basically a special case of a schema that has a
> single field.
>
> In BeamJava we have a SchemaRegistry which knows how to turn user types
> into schemas. We use reflection to analyze many user types (e.g. simple
> POJO structs, JavaBean classes, Avro records, protocol buffers, etc.) to
> determine the schema, however this is done only when the graph is initially
> generated. We do use code generation (in Java we do bytecode generation) to
> make this somewhat more efficient. I'm willing to bet that the code
> generator you've written for structs could be very easily modified for
> schemas instead, so it would not be wasted work if we went with schemas.
>
> One of the things I'm working on now is documenting Beam schemas. They are
> already very powerful and useful, but since there is still nothing in our
> documentation about them, they are not yet widely used. I expect to finish
> draft documentation by the end of January.
>
> Reuven
>
> On Thu, Jan 3, 2019 at 11:32 PM Robert Burke  wrote:
>
>> That's an interesting idea. I must confess I don't rightly know the
>> difference between a schema and coder, but here's what I've got with a bit
>> of searching through memory and the mailing list. Please let me know if I'm
>> off track.
>>
>> As near as I can tell, a schema, as far as Beam takes it
>> 
>>  is
>> a mechanism to define what data is extracted from a given row of data. So
>> in principle, there's an opportunity to be more efficient with data with
>> many columns that aren't being used, and only extract the data that's
>> meaningful to the pipeline.
>> The trick then is how to apply the schema to a given serialization
>> format, which is something I'm missing in my mental model (and then how to
>> do it efficiently in Go).
>>
>> I do know that the Go client package for BigQuery
>>  does
>> something like that, using field tags. Similarly, the "encoding/json"
>>  package in the Go
>> Standard Library permits annotating fields and it will read out and
>> deserialize the JSON fields and that's it.
>>
>> A concern I have is that Go (at present) would require pre-compile time
>> code generation for schemas to be efficient, and they would still mostly
>> boil down to turning []bytes into real structs. Go reflection doesn't keep
>> up.
>> Go has no mechanism I'm aware of to Just In Time compile more efficient
>> processing of values.
>> It's also not 100% clear how Schema's would play with protocol buffers or
>> similar.
>> BigQuery has a mechanism of generating a JSON schema from a proto file
>> , but
>> that's only the specification half, not the using half.
>>
>> As it stands, the code generator I've been building these last months
>> could (in principle) statically analyze a user's struct, and then generate
>> an efficient dedicated coder for 

Re: [Go SDK] User Defined Coders

2019-01-03 Thread Reuven Lax
The biggest advantage is actually readability and usability. A secondary
advantage is that it means that Go will be able to interact seamlessly with
BeamSQL, which would be a big win for Go.

A schema is basically a way of saying that a record has a specific set of
(possibly nested, possibly repeated) fields. So for instance let's say that
the user's type is a struct with fields named user, country, purchaseCost.
This allows us to provide transforms that operate on field names. Some
example (using the Java API):

PCollection users = events.apply(Select.fields("user"));  // Select out
only the user field.

PCollection joinedEvents =
queries.apply(Join.innerJoin(clicks).byFields("user"));  // Join two
PCollections by user.

// For each country, calculate the total purchase cost as well as the top
10 purchases.
// A new schema is created containing fields total_cost and top_purchases,
and rows are created with the aggregation results.
PCollection purchaseStatistics = events.apply(
Group.byFieldNames("country")
   .aggregateField("purchaseCost", Sum.ofLongs(), "total_cost"))
.aggregateField("purchaseCost", Top.largestLongs(10),
"top_purchases"))


This is far more readable than what we have today, and what unlocks this is
that Beam actually knows the structure of the record instead of assuming
records are uncrackable blobs.

Note that a coder is basically a special case of a schema that has a single
field.

In BeamJava we have a SchemaRegistry which knows how to turn user types
into schemas. We use reflection to analyze many user types (e.g. simple
POJO structs, JavaBean classes, Avro records, protocol buffers, etc.) to
determine the schema, however this is done only when the graph is initially
generated. We do use code generation (in Java we do bytecode generation) to
make this somewhat more efficient. I'm willing to bet that the code
generator you've written for structs could be very easily modified for
schemas instead, so it would not be wasted work if we went with schemas.

One of the things I'm working on now is documenting Beam schemas. They are
already very powerful and useful, but since there is still nothing in our
documentation about them, they are not yet widely used. I expect to finish
draft documentation by the end of January.

Reuven

On Thu, Jan 3, 2019 at 11:32 PM Robert Burke  wrote:

> That's an interesting idea. I must confess I don't rightly know the
> difference between a schema and coder, but here's what I've got with a bit
> of searching through memory and the mailing list. Please let me know if I'm
> off track.
>
> As near as I can tell, a schema, as far as Beam takes it
> 
>  is
> a mechanism to define what data is extracted from a given row of data. So
> in principle, there's an opportunity to be more efficient with data with
> many columns that aren't being used, and only extract the data that's
> meaningful to the pipeline.
> The trick then is how to apply the schema to a given serialization format,
> which is something I'm missing in my mental model (and then how to do it
> efficiently in Go).
>
> I do know that the Go client package for BigQuery
>  does
> something like that, using field tags. Similarly, the "encoding/json"
>  package in the Go
> Standard Library permits annotating fields and it will read out and
> deserialize the JSON fields and that's it.
>
> A concern I have is that Go (at present) would require pre-compile time
> code generation for schemas to be efficient, and they would still mostly
> boil down to turning []bytes into real structs. Go reflection doesn't keep
> up.
> Go has no mechanism I'm aware of to Just In Time compile more efficient
> processing of values.
> It's also not 100% clear how Schema's would play with protocol buffers or
> similar.
> BigQuery has a mechanism of generating a JSON schema from a proto file
> , but that's
> only the specification half, not the using half.
>
> As it stands, the code generator I've been building these last months
> could (in principle) statically analyze a user's struct, and then generate
> an efficient dedicated coder for it. It just has no where to put them such
> that the Go SDK would use it.
>
>
> On Thu, Jan 3, 2019 at 1:39 PM Reuven Lax  wrote:
>
>> I'll make a different suggestion. There's been some chatter that schemas
>> are a better tool than coders, and that in Beam 3.0 we should make schemas
>> the basic semantics instead of coders. Schemas provide everything a coder
>> provides, but also allows for far more readable code. We can't make such a
>> change in Beam Java 2.X for compatibility reasons, but maybe in Go we're
>> better off starting with schemas instead 

Re: Add code quality checks to pre-commits.

2019-01-03 Thread Kenneth Knowles
It would be very useful to have line and/or branch coverage visible. These
are both very weak proxies for quality or reliability, so IMO strict
thresholds are not helpful. One thing that is super useful is to integrate
line coverage into code review, like this:
https://docs.codecov.io/docs/browser-extension. It is very easy to notice
major missing tests.

We have never really used Sonarqube. It was turned on as a possibility in
the early days but never worked on past that point. Could be nice. I
suspect there's a lot to be gained by just finding very low numbers and
improving them. So just running Jacoco's offline HTML generation would do
it (also this integrates with Jenkins). I tried this the other day and
discovered that our gradle config is broken and does not wire tests and
coverage reporting together properly. Last thing: How is "technical debt"
measured? I'm skeptical of quantitative measures for qualitative notions.

Kenn

On Thu, Jan 3, 2019 at 1:58 PM Heejong Lee  wrote:

> I don't have any experience of using SonarQube but Coverity worked well
> for me. Looks like it already has beam repo:
> https://scan.coverity.com/projects/11881
>
> On Thu, Jan 3, 2019 at 1:27 PM Reuven Lax  wrote:
>
>> checkstyle and findbugs are already run as precommit checks, are they not?
>>
>> On Thu, Jan 3, 2019 at 7:19 PM Mikhail Gryzykhin 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> In our current builds we (can) run multiple code quality checks tools
>>> like checkstyle, findbugs, code test coverage via cubertura. However we do
>>> not utilize many of those signals.
>>>
>>> I suggest to add requirements to code based on those tools.
>>> Specifically, I suggest to add pre-commit checks that will require PRs to
>>> conform to some quality checks.
>>>
>>> We can see good example of thresholds to add at Apache SonarQube
>>> provided default quality gate config
>>> :
>>> 80% tests coverage on new code,
>>> 5% technical technical debt on new code,
>>> No bugs/Vulnerabilities added.
>>>
>>> As another part of this proposal, I want to suggest the use of SonarQube
>>> for tracking code statistics and as agent for enforcing code quality
>>> thresholds. It is Apache provided tool that has integration with Jenkins or
>>> Gradle via plugins.
>>>
>>> I believe some reporting to SonarQube was configured for mvn builds of
>>> some of Beam sub-projects, but was lost during migration to gradle.
>>>
>>> I was looking for other options, but so far found only general configs
>>> to gradle builds that will fail build if code coverage for project is too
>>> low. Such approach will force us to backfill tests for all existing code
>>> that can be tedious and demand learning of all legacy code that might not
>>> be part of current work.
>>>
>>> I suggest to discuss and come to conclusion on two points in this tread:
>>> 1. Do we want to add code quality checks to our pre-commit jobs and
>>> require them to pass before PR is merged?
>>>
>>> Suggested: Add code quality checks listed above at first, adjust them as
>>> we see fit in the future.
>>>
>>> 2. What tools do we want to utilize for analyzing code quality?
>>>
>>> Under discussion. Suggested: SonarQube, but will depend on functionality
>>> level we want to achieve.
>>>
>>>
>>> Regards,
>>> --Mikhail
>>>
>>


Re: workspace cleanups needed on jenkins master

2019-01-03 Thread Udi Meiri
On Thu, Dec 27, 2018 at 11:02 AM Ismaël Mejía  wrote:

> Bringing this subject for awareness to dev@
> We are sadly part of this top.
> Does somebody know what this data is? And if we can clean it periodically?
> Can somebody with more sysadmin super powers take a look and act on this.
>
> -- Forwarded message -
> From: Chris Lambertus 
> Date: Thu, Dec 27, 2018 at 1:36 AM
> Subject: workspace cleanups needed on jenkins master
> To: 
>
>
> All,
>
> The Jenkins master needs to be cleaned up. Could the following
> projects please reduce your usage significantly by 5 January. After 5
> Jan Infra will be purging more aggressively and updating job
> configurations as needed. As a rule of thumb, we’d like to see
> projects retain no more than 1 week or 7 builds worth of historical
> data at the absolute maximum. Larger projects should retain less to
> avoid using up a disproportionate amount of space on the master.
>
> Some workspaces without any identifiable associated Project will be
> removed.
>
>
>
> 3911 GB .
> 275 GB ./incubator-netbeans-linux
> 270 GB ./pulsar-website-build
> 249 GB ./pulsar-master
> 199 GB ./Packaging
> 127 GB ./HBase
> 121 GB ./PreCommit-ZOOKEEPER-github-pr-build
> 107 GB ./Any23-trunk
> 102 GB ./incubator-netbeans-release
> 79 GB ./incubator-netbeans-linux-experiment
> 77 GB ./beam_PostCommit_Java_PVR_Flink_Batch
>

Wow, this job has huge logs (400MB+).
https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_Java_PVR_Flink_Batch/330/console
A few weeks back I suggested removing the --info flag passed to Gradle.
I haven't done that yet, but it might help reduce the size of logs. (which
I assume are all stored on master?)

Short-term, we could reduce retention back down to 14 instead of the current
30

.
+Alan Myrvold  +Mikhail Gryzykhin   we
don't need 30 days retention any longer for test dashboards, right?


> 76 GB ./HBase-1.3-JDK8
> 70 GB ./Jackrabbit-Oak-Windows
> 70 GB ./stanbol-0.12
> 59 GB ./HBase-Find-Flaky-Tests
> 54 GB ./CouchDB
> 51 GB ./beam_PostCommit_Java_PVR_Flink_Streaming
> 48 GB ./incubator-netbeans-windows
> 47 GB ./FlexJS
> 42 GB ./HBase
> 41 GB ./pulsar-pull-request
> 37 GB ./ZooKeeper_branch35_jdk8
> 32 GB ./HBase-Flaky-Tests
> 32 GB ./Atlas-master-NoTests
> 31 GB ./Atlas-1.0-NoTests
> 31 GB ./beam_PreCommit_Java_Commit
> 30 GB ./Zookeeper_UT_Java18
> 29 GB ./Phoenix-4.x-HBase-1.4
> 28 GB ./HBase-2.0-hadoop3-tests
> 27 GB ./flink-github-ci
> 27 GB ./ZooKeeper_branch35_jdk7
> 27 GB ./oodt-trunk
> 25 GB ./opennlp
> 25 GB ./Trinidad
> 22 GB ./Phoenix-4.x-HBase-1.3
> 21 GB ./ZooKeeper_Flaky_StressTest
> 21 GB ./Atlas-master-AllTests
> 21 GB ./beam_PostCommit_Java_ValidatesRunner_Flink
> 20 GB ./HBase-1.3-JDK7
> 20 GB ./PreCommit-HBASE-Build
> 18 GB ./hadoop-trunk-win
> 18 GB ./HBase-1.2-JDK7
> 18 GB ./HBASE-14070.HLC
> 18 GB ./maven-box
> 17 GB ./Atlas-1.0-AllTests
> 17 GB ./Archiva-TLP-Gitbox
> 17 GB ./Apache
> 17 GB ./Phoenix-5.x-HBase-2.0
> 17 GB ./Phoenix-omid2
> 16 GB ./Lucene-Solr-BadApples-NightlyTests-7.x
> 15 GB ./HBase-2.0
> 14 GB ./flume-trunk
> 14 GB ./beam_PostCommit_Java_ValidatesRunner_Samza
> 14 GB ./HBase-Trunk_matrix
> 13 GB ./commons-csv
> 13 GB ./HBase-Flaky-Tests-old-just-master
> 13 GB ./oodt-coverage
> 12 GB ./incubator-rya-master-with-optionals
> 12 GB ./Syncope-master-deploy
> 11 GB ./PreCommit-PHOENIX-Build
> 11 GB ./Stratos-Master-Nightly-Build
> 11 GB ./Phoenix-master
> 11 GB ./Hadoop-trunk-JACC
> 10 GB ./ctakes-trunk-package
> 10 GB ./FlexJS
> 10 GB ./Atlas-1.0-IntegrationTests
> 9 GB ./incubator-rya-master
> 9 GB ./Atlas-master-IntegrationTests
> 9 GB ./beam_PostCommit_Java_ValidatesRunner_Spark
> 9 GB ./ZooKeeper_UT_Java7
> 9 GB ./Qpid-Broker-J-7.0.x-TestMatrix
> 9 GB ./oodt-dependency-update
> 9 GB ./Apache
> 8 GB ./Struts-examples-JDK8-master
> 8 GB ./Phoenix-4.x-HBase-1.2
> 8 GB ./flume-github-pull-request
> 8 GB ./HBase-HBASE-14614
> 8 GB ./tika-trunk-jdk1.7
> 8 GB ./HBase-1.2-JDK8
> 8 GB ./HBase-1.5
> 7 GB ./Atlas-master-UnitTests
> 7 GB ./tika-2.x-windows
> 7 GB ./incubator-rya-master-with-optionals-pull-requests
> 7 GB ./Hive-trunk
> 7 GB ./beam_PreCommit_Java_Cron
> 7 GB ./Atlas-1.0-UnitTests
> 6 GB ./Jackrabbit
> 6 GB ./beam_PostCommit_Java_PVR_Flink_PR
> 6 GB ./Lucene-Solr-Clover-master
> 6 GB ./Syncope-2_0_X-deploy
> 6 GB ./beam_PostCommit_Java_ValidatesRunner_Apex
> 6 GB ./Tika-trunk
> 6 GB ./pirk
> 6 GB ./Syncope-2_1_X-deploy
> 6 GB ./PLC4X
> 6 GB ./myfaces-current-2.0-integration-tests
> 5 GB ./commons-lang
> 5 GB ./Nemo
> 5 GB ./Mesos-Buildbot
> 5 GB ./Qpid-Broker-J-7.1.x-TestMatrix
> 5 GB ./beam_PostCommit_Java_Nexmark_Flink
> 5 GB ./Qpid-Broker-J-TestMatrix
> 5 GB ./ZooKeeper-Hammer
> 5 GB ./Camel
> 5 GB ./Royale
> 5 GB ./tika-branch-1x
> 5 GB ./ManifoldCF-ant
> 5 GB ./PreCommit-SQOOP-Build
> 5 GB ./HBase-1.4
> 5 GB ./ZooKeeper_UT_Stress
> 4 GB 

Re: Add code quality checks to pre-commits.

2019-01-03 Thread Heejong Lee
I don't have any experience of using SonarQube but Coverity worked well for
me. Looks like it already has beam repo:
https://scan.coverity.com/projects/11881

On Thu, Jan 3, 2019 at 1:27 PM Reuven Lax  wrote:

> checkstyle and findbugs are already run as precommit checks, are they not?
>
> On Thu, Jan 3, 2019 at 7:19 PM Mikhail Gryzykhin 
> wrote:
>
>> Hi everyone,
>>
>> In our current builds we (can) run multiple code quality checks tools
>> like checkstyle, findbugs, code test coverage via cubertura. However we do
>> not utilize many of those signals.
>>
>> I suggest to add requirements to code based on those tools. Specifically,
>> I suggest to add pre-commit checks that will require PRs to conform to some
>> quality checks.
>>
>> We can see good example of thresholds to add at Apache SonarQube
>> provided default quality gate config
>> :
>> 80% tests coverage on new code,
>> 5% technical technical debt on new code,
>> No bugs/Vulnerabilities added.
>>
>> As another part of this proposal, I want to suggest the use of SonarQube
>> for tracking code statistics and as agent for enforcing code quality
>> thresholds. It is Apache provided tool that has integration with Jenkins or
>> Gradle via plugins.
>>
>> I believe some reporting to SonarQube was configured for mvn builds of
>> some of Beam sub-projects, but was lost during migration to gradle.
>>
>> I was looking for other options, but so far found only general configs to
>> gradle builds that will fail build if code coverage for project is too low.
>> Such approach will force us to backfill tests for all existing code that
>> can be tedious and demand learning of all legacy code that might not be
>> part of current work.
>>
>> I suggest to discuss and come to conclusion on two points in this tread:
>> 1. Do we want to add code quality checks to our pre-commit jobs and
>> require them to pass before PR is merged?
>>
>> Suggested: Add code quality checks listed above at first, adjust them as
>> we see fit in the future.
>>
>> 2. What tools do we want to utilize for analyzing code quality?
>>
>> Under discussion. Suggested: SonarQube, but will depend on functionality
>> level we want to achieve.
>>
>>
>> Regards,
>> --Mikhail
>>
>


Re: Why does Beam not use the google-api-client libraries?

2019-01-03 Thread Reuven Lax
Cham is absolutely correct. The google-cloud-pubsub higher-level library
didn't exist when the Beam connector were written, and nobody has gotten
around to rewriting that connector.

On Wed, Jan 2, 2019 at 11:29 PM Chamikara Jayalath 
wrote:

> Thanks Jeff for the interest in this.
>
> I think most of the existing GCP IO connectors use Google API client
> libraries [1] due to historical reasons (these were the libraries that were
> available when these connectors were originally built).
> We should upgrade to latest Google Cloud client libraries [2] at some
> point but I don't have an exact ETA for this.
>
> Thanks,
> Cham
>
> [1] https://developers.google.com/api-client-library/
> [2] https://cloud.google.com/apis/docs/cloud-client-libraries
>
> On Wed, Jan 2, 2019 at 10:38 AM Jeff Klukas  wrote:
>
>> My apologies. I got the terminology entirely wrong.
>>
>> As you say, PubsubIO and other Beam components _do_ use the official
>> Google API client library (google-api-client). They do not, however, use
>> the higher-level Google Cloud libraries such as google-cloud-pubsub which
>> provide abstractions on top of the API client library.
>>
>> I am wondering whether there are technical reasons not to use the
>> higher-level service-specific libraries, or whether this is simply
>> historical.
>>
>> On Wed, Jan 2, 2019 at 12:38 PM Anton Kedin  wrote:
>>
>>> I don't have enough context to answer all of the questions, but looking
>>> at PubsubIO it seems to use the official libraries, e.g. see Pubsub doc
>>> [1]  vs Pubsub IO GRPC client [2]. Correct me if I misunderstood
>>> your question.
>>>
>>> [1]
>>> https://cloud.google.com/pubsub/docs/publisher#pubsub-publish-message-java
>>> [2]
>>> https://github.com/apache/beam/blob/2e759fecf63d62d110f29265f9438128e3bdc8ab/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubGrpcClient.java#L189
>>>
>>> Pubsub IO JSON client seems to use a slightly different approach but
>>> still relies on somewhat official path, e.g. Pubsub doc [3] (javadoc[4]) vs
>>> Pubsub IO JSON client [5].
>>>
>>> [3] https://developers.google.com/api-client-library/java/apis/pubsub/v1
>>> [4]
>>> https://developers.google.com/resources/api-libraries/documentation/pubsub/v1/java/latest/com/google/api/services/pubsub/Pubsub.html
>>> [5]
>>> https://github.com/apache/beam/blob/2e759fecf63d62d110f29265f9438128e3bdc8ab/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubJsonClient.java#L130
>>>
>>> The latter seems to be the older library, so I would assume it's for
>>> legacy reasons.
>>>
>>> Regards,
>>> Anton
>>>
>>>
>>> On Wed, Jan 2, 2019 at 9:03 AM Jeff Klukas  wrote:
>>>
 I'm building a high-volume Beam pipeline using PubsubIO and running
 into some concerns over performance and delivery semantics, prompting me to
 want to better understand the implementation. Reading through the library,
 PubsubIO appears to be a completely separate implementation of Pubsub
 client behavior from Google's own Java client. As a developer trying to
 read and understand the implementation, this is a significant hurdle, since
 any previous knowledge of the Google library is not applicable and is
 potentially at odds with what's in PubsubIO.

 Why doesn't beam use the Google clients for PubsubIO, BigQueryIO, etc.?
 Is it for historical reasons? Is there difficulty in packaging and
 integration of the Google clients? Or are the needs for Beam just
 substantially different from what the Google libraries provide?

>>>


Re: [Go SDK] User Defined Coders

2019-01-03 Thread Reuven Lax
I'll make a different suggestion. There's been some chatter that schemas
are a better tool than coders, and that in Beam 3.0 we should make schemas
the basic semantics instead of coders. Schemas provide everything a coder
provides, but also allows for far more readable code. We can't make such a
change in Beam Java 2.X for compatibility reasons, but maybe in Go we're
better off starting with schemas instead of coders?

Reuven

On Thu, Jan 3, 2019 at 8:45 PM Robert Burke  wrote:

> One area that the Go SDK currently lacks: is the ability for users to
> specify their own coders for types.
>
> I've written a proposal document,
> 
>  and
> while I'm confident about the core, there are certainly some edge cases
> that require discussion before getting on with the implementation.
>
> At presently, the SDK only permits primitive value types (all numeric
> types but complex, strings, and []bytes) which are coded with beam coders,
> and structs whose exported fields are of those type, which is then encoded
> as JSON. Protocol buffer support is hacked in to avoid the type anaiyzer,
> and presents the current work around this issue.
>
> The high level proposal is to catch up with Python and Java, and have a
> coder registry. In addition, arrays, and maps should be permitted as well.
>
> If you have alternatives, or other suggestions and opinions, I'd love to
> hear them! Otherwise my intent is to get a PR ready by the end of January.
>
> Thanks!
> Robert Burke
>


Re: Add code quality checks to pre-commits.

2019-01-03 Thread Reuven Lax
checkstyle and findbugs are already run as precommit checks, are they not?

On Thu, Jan 3, 2019 at 7:19 PM Mikhail Gryzykhin  wrote:

> Hi everyone,
>
> In our current builds we (can) run multiple code quality checks tools like
> checkstyle, findbugs, code test coverage via cubertura. However we do not
> utilize many of those signals.
>
> I suggest to add requirements to code based on those tools. Specifically,
> I suggest to add pre-commit checks that will require PRs to conform to some
> quality checks.
>
> We can see good example of thresholds to add at Apache SonarQube provided
> default quality gate config
> :
> 80% tests coverage on new code,
> 5% technical technical debt on new code,
> No bugs/Vulnerabilities added.
>
> As another part of this proposal, I want to suggest the use of SonarQube
> for tracking code statistics and as agent for enforcing code quality
> thresholds. It is Apache provided tool that has integration with Jenkins or
> Gradle via plugins.
>
> I believe some reporting to SonarQube was configured for mvn builds of
> some of Beam sub-projects, but was lost during migration to gradle.
>
> I was looking for other options, but so far found only general configs to
> gradle builds that will fail build if code coverage for project is too low.
> Such approach will force us to backfill tests for all existing code that
> can be tedious and demand learning of all legacy code that might not be
> part of current work.
>
> I suggest to discuss and come to conclusion on two points in this tread:
> 1. Do we want to add code quality checks to our pre-commit jobs and
> require them to pass before PR is merged?
>
> Suggested: Add code quality checks listed above at first, adjust them as
> we see fit in the future.
>
> 2. What tools do we want to utilize for analyzing code quality?
>
> Under discussion. Suggested: SonarQube, but will depend on functionality
> level we want to achieve.
>
>
> Regards,
> --Mikhail
>


Re: Add code quality checks to pre-commits.

2019-01-03 Thread Scott Wegner
+1 to enabling warnings first. This will allow us to evaluate how much
value we get and the frequency of false-positives / noise.

I'd be curious to hear from those that've worked in other projects on
experiences with SonarQube or other tools.

At Google, code coverage is integrated into the code review workflow, which
is very valuable for spotting testing gaps during review. Some details
here:
https://testing.googleblog.com/2014/07/measuring-coverage-at-google.html


On Thu, Jan 3, 2019 at 11:34 AM Robert Burke  wrote:

> I had the same question, and tt supports many more than we do:
> https://www.sonarqube.org/features/multi-languages/
>
>  All the various rules checks have clear explanations and justifications
> for why they're doing what they do.
> It would be quite handy as part of the precommits I think, if at least as
> a warning or similar, until we get the noise down.
>
> On Thu, 3 Jan 2019 at 10:55, Udi Meiri  wrote:
>
>> +1 for adding more code quality signals. Could we add them in an
>> advisory-only mode at first? (a warning and not an error)
>>
>> I'm curious how the "technical debt" metric is determined.
>>
>> I'm not familiar with SonarQube. What languages does it support?
>>
>> On Thu, Jan 3, 2019 at 10:19 AM Mikhail Gryzykhin 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> In our current builds we (can) run multiple code quality checks tools
>>> like checkstyle, findbugs, code test coverage via cubertura. However we do
>>> not utilize many of those signals.
>>>
>>> I suggest to add requirements to code based on those tools.
>>> Specifically, I suggest to add pre-commit checks that will require PRs to
>>> conform to some quality checks.
>>>
>>> We can see good example of thresholds to add at Apache SonarQube
>>> provided default quality gate config
>>> :
>>> 80% tests coverage on new code,
>>> 5% technical technical debt on new code,
>>> No bugs/Vulnerabilities added.
>>>
>>> As another part of this proposal, I want to suggest the use of SonarQube
>>> for tracking code statistics and as agent for enforcing code quality
>>> thresholds. It is Apache provided tool that has integration with Jenkins or
>>> Gradle via plugins.
>>>
>>> I believe some reporting to SonarQube was configured for mvn builds of
>>> some of Beam sub-projects, but was lost during migration to gradle.
>>>
>>> I was looking for other options, but so far found only general configs
>>> to gradle builds that will fail build if code coverage for project is too
>>> low. Such approach will force us to backfill tests for all existing code
>>> that can be tedious and demand learning of all legacy code that might not
>>> be part of current work.
>>>
>>> I suggest to discuss and come to conclusion on two points in this tread:
>>> 1. Do we want to add code quality checks to our pre-commit jobs and
>>> require them to pass before PR is merged?
>>>
>>> Suggested: Add code quality checks listed above at first, adjust them as
>>> we see fit in the future.
>>>
>>> 2. What tools do we want to utilize for analyzing code quality?
>>>
>>> Under discussion. Suggested: SonarQube, but will depend on functionality
>>> level we want to achieve.
>>>
>>>
>>> Regards,
>>> --Mikhail
>>>
>>

-- 




Got feedback? tinyurl.com/swegner-feedback


Re: Beam Summits!

2019-01-03 Thread Austin Bennett
Hi Matthias, etc,

Trying to get thoughts on formalizing a process for getting proposals
together.  I look forward to the potential day that there are many people
that want (rather than just willing) to host a summit in a given region in
a given year.  Perhaps too forward looking.

Also, you mentioned planning London wound up with a tight time window.  If
shooting for April in SF, seems  the clock might be starting to tick.  Any
advice for how much time needed?  And guidance on getting whatever formal
needed through Apache - and does this also necessarily involve a Beam PMC
or community vote (probably more related to the first paragraph)?

Thanks,
Austin

On Thu, Dec 20, 2018, 1:09 AM Matthias Baetens  Great stuff, thanks for the overview, Austin.
>
> For EU, there are things to say for both Stockholm and Berlin, but I think
> it makes sense to do it on the back of another conference (larger chance of
> people being in town with the same interest). I like Thomas comment - we
> will attract more people from the US if we don't let it conflict with the
> big events there. +1 for doing it around the time of Berlin Buzzwords.
>
> For Asia, I'd imagine Singapore would be an option as well. I'll reach out
> to some people that are based there to get a grasp on the size of the
> community there.
>
> Best,
> -M
>
>
>
> On Thu, 20 Dec 2018 at 05:08, Thomas Weise  wrote:
>
>> I think for EU there is a proposal to have it next to Berlin Buzzwords in
>> June. That would provide better spacing and avoid conflict with ApacheCon.
>>
>> Thomas
>>
>>
>> On Wed, Dec 19, 2018 at 3:09 PM Suneel Marthi  wrote:
>>
>>> How about Beam Summit in Berlin on Sep 6 immediately following Flink
>>> Forward Berlin on the previous 2 days.
>>>
>>> Same may be for Asia also following Flink Forward Asia where and
>>> whenever it happens.
>>>
>>> On Wed, Dec 19, 2018 at 6:06 PM Austin Bennett <
>>> whatwouldausti...@gmail.com> wrote:
>>>
 Hi All,

 I really enjoyed Beam Summit in London (Thanks Matthias!), and there
 was much enthusiasm for continuations.  We had selected that location in a
 large part due to the growing community there, and we have users in a
 variety of locations.  In our 2019 calendar,
 https://docs.google.com/spreadsheets/d/1CloF63FOKSPM6YIuu8eExjhX6xrIiOp5j4zPbSg3Apo/
 shared in the past weeks, 3 Summits are tentatively slotted for this year.
 Wanting to start running this by the group to get input.

 * Beam Summit NA, in San Francisco, approx 3 April 2019 (following
 Flink Forward).  I can organize.
 * Beam Summit Europe, in Stockholm, this was the runner up in voting
 falling behind London.  Or perhaps Berlin?  October-ish 2019
 * Beam Summit Asia, in Tokyo ??

 What are general thoughts on locations/dates?

 Looking forward to convening in person soon.

 Cheers,
 Austin

>>>


[Go SDK] User Defined Coders

2019-01-03 Thread Robert Burke
One area that the Go SDK currently lacks: is the ability for users to
specify their own coders for types.

I've written a proposal document,

and
while I'm confident about the core, there are certainly some edge cases
that require discussion before getting on with the implementation.

At presently, the SDK only permits primitive value types (all numeric types
but complex, strings, and []bytes) which are coded with beam coders, and
structs whose exported fields are of those type, which is then encoded as
JSON. Protocol buffer support is hacked in to avoid the type anaiyzer, and
presents the current work around this issue.

The high level proposal is to catch up with Python and Java, and have a
coder registry. In addition, arrays, and maps should be permitted as well.

If you have alternatives, or other suggestions and opinions, I'd love to
hear them! Otherwise my intent is to get a PR ready by the end of January.

Thanks!
Robert Burke


Re: Add code quality checks to pre-commits.

2019-01-03 Thread Robert Burke
I had the same question, and tt supports many more than we do:
https://www.sonarqube.org/features/multi-languages/

 All the various rules checks have clear explanations and justifications
for why they're doing what they do.
It would be quite handy as part of the precommits I think, if at least as a
warning or similar, until we get the noise down.

On Thu, 3 Jan 2019 at 10:55, Udi Meiri  wrote:

> +1 for adding more code quality signals. Could we add them in an
> advisory-only mode at first? (a warning and not an error)
>
> I'm curious how the "technical debt" metric is determined.
>
> I'm not familiar with SonarQube. What languages does it support?
>
> On Thu, Jan 3, 2019 at 10:19 AM Mikhail Gryzykhin 
> wrote:
>
>> Hi everyone,
>>
>> In our current builds we (can) run multiple code quality checks tools
>> like checkstyle, findbugs, code test coverage via cubertura. However we do
>> not utilize many of those signals.
>>
>> I suggest to add requirements to code based on those tools. Specifically,
>> I suggest to add pre-commit checks that will require PRs to conform to some
>> quality checks.
>>
>> We can see good example of thresholds to add at Apache SonarQube
>> provided default quality gate config
>> :
>> 80% tests coverage on new code,
>> 5% technical technical debt on new code,
>> No bugs/Vulnerabilities added.
>>
>> As another part of this proposal, I want to suggest the use of SonarQube
>> for tracking code statistics and as agent for enforcing code quality
>> thresholds. It is Apache provided tool that has integration with Jenkins or
>> Gradle via plugins.
>>
>> I believe some reporting to SonarQube was configured for mvn builds of
>> some of Beam sub-projects, but was lost during migration to gradle.
>>
>> I was looking for other options, but so far found only general configs to
>> gradle builds that will fail build if code coverage for project is too low.
>> Such approach will force us to backfill tests for all existing code that
>> can be tedious and demand learning of all legacy code that might not be
>> part of current work.
>>
>> I suggest to discuss and come to conclusion on two points in this tread:
>> 1. Do we want to add code quality checks to our pre-commit jobs and
>> require them to pass before PR is merged?
>>
>> Suggested: Add code quality checks listed above at first, adjust them as
>> we see fit in the future.
>>
>> 2. What tools do we want to utilize for analyzing code quality?
>>
>> Under discussion. Suggested: SonarQube, but will depend on functionality
>> level we want to achieve.
>>
>>
>> Regards,
>> --Mikhail
>>
>


Re: Add code quality checks to pre-commits.

2019-01-03 Thread Udi Meiri
+1 for adding more code quality signals. Could we add them in an
advisory-only mode at first? (a warning and not an error)

I'm curious how the "technical debt" metric is determined.

I'm not familiar with SonarQube. What languages does it support?

On Thu, Jan 3, 2019 at 10:19 AM Mikhail Gryzykhin  wrote:

> Hi everyone,
>
> In our current builds we (can) run multiple code quality checks tools like
> checkstyle, findbugs, code test coverage via cubertura. However we do not
> utilize many of those signals.
>
> I suggest to add requirements to code based on those tools. Specifically,
> I suggest to add pre-commit checks that will require PRs to conform to some
> quality checks.
>
> We can see good example of thresholds to add at Apache SonarQube provided
> default quality gate config
> :
> 80% tests coverage on new code,
> 5% technical technical debt on new code,
> No bugs/Vulnerabilities added.
>
> As another part of this proposal, I want to suggest the use of SonarQube
> for tracking code statistics and as agent for enforcing code quality
> thresholds. It is Apache provided tool that has integration with Jenkins or
> Gradle via plugins.
>
> I believe some reporting to SonarQube was configured for mvn builds of
> some of Beam sub-projects, but was lost during migration to gradle.
>
> I was looking for other options, but so far found only general configs to
> gradle builds that will fail build if code coverage for project is too low.
> Such approach will force us to backfill tests for all existing code that
> can be tedious and demand learning of all legacy code that might not be
> part of current work.
>
> I suggest to discuss and come to conclusion on two points in this tread:
> 1. Do we want to add code quality checks to our pre-commit jobs and
> require them to pass before PR is merged?
>
> Suggested: Add code quality checks listed above at first, adjust them as
> we see fit in the future.
>
> 2. What tools do we want to utilize for analyzing code quality?
>
> Under discussion. Suggested: SonarQube, but will depend on functionality
> level we want to achieve.
>
>
> Regards,
> --Mikhail
>


smime.p7s
Description: S/MIME Cryptographic Signature


Add code quality checks to pre-commits.

2019-01-03 Thread Mikhail Gryzykhin
Hi everyone,

In our current builds we (can) run multiple code quality checks tools like
checkstyle, findbugs, code test coverage via cubertura. However we do not
utilize many of those signals.

I suggest to add requirements to code based on those tools. Specifically, I
suggest to add pre-commit checks that will require PRs to conform to some
quality checks.

We can see good example of thresholds to add at Apache SonarQube provided
default quality gate config
:
80% tests coverage on new code,
5% technical technical debt on new code,
No bugs/Vulnerabilities added.

As another part of this proposal, I want to suggest the use of SonarQube
for tracking code statistics and as agent for enforcing code quality
thresholds. It is Apache provided tool that has integration with Jenkins or
Gradle via plugins.

I believe some reporting to SonarQube was configured for mvn builds of some
of Beam sub-projects, but was lost during migration to gradle.

I was looking for other options, but so far found only general configs to
gradle builds that will fail build if code coverage for project is too low.
Such approach will force us to backfill tests for all existing code that
can be tedious and demand learning of all legacy code that might not be
part of current work.

I suggest to discuss and come to conclusion on two points in this tread:
1. Do we want to add code quality checks to our pre-commit jobs and require
them to pass before PR is merged?

Suggested: Add code quality checks listed above at first, adjust them as we
see fit in the future.

2. What tools do we want to utilize for analyzing code quality?

Under discussion. Suggested: SonarQube, but will depend on functionality
level we want to achieve.


Regards,
--Mikhail


Re: [PROPOSAL] Prepare Beam 2.10.0 release

2019-01-03 Thread Maximilian Michels
Thanks for driving this Kenn! I'm in favor of a strict cut off, but I'd like to 
propose a week for cherry-picking relevant changes to the release branch. It 
looks like many people are returning from holidays or are still off.


Cheers,
Max

On 02.01.19 17:20, Kenneth Knowles wrote:

Done. I've created the Jira tag for 2.11.0.

Previously, there was a few days warning to get things in before the branch is 
cut. You can just cherry-pick them. This is a bit better for release stability 
by avoiding all the other changes on master. The timing of the cut is always 
going to include older and newer changes anyhow.


Kenn

On Wed, Jan 2, 2019 at 1:08 PM Ismaël Mejía > wrote:


Can you please create 2.11 tag in JIRA so we can move the JIRAs that
are not blocking. I have quite a bunch of pending code reviews that
hoped to get into this one but well now probably they shall wait. (My
excuses for the people who may be impacted, I had not checked that the
date was in the first week).

On Wed, Jan 2, 2019 at 4:45 PM Jean-Baptiste Onofré mailto:j...@nanthrax.net>> wrote:
 >
 > It sounds good to me.
 >
 > Regards
 > JB
 >
 > On 02/01/2019 16:16, Kenneth Knowles wrote:
 > > Hi All,
 > >
 > > According to the release calendar [1] branch cut date for
 > > Beam 2.10.0 release is today, 2019 January 2. I'd like to volunteer to
 > > manage this release. Does anyone have any reason we should not release
 > > on schedule?
 > >
 > > Otherwise, if you know of release-blocking bugs, please mark their "Fix
 > > Version" as 2.10.0 and they will show up in the burndown [2]. If you 
own
 > > a bug currently in the burndown, please double-check if it is truly
 > > release-blocking.
 > >
 > > I've gone ahead and cut a release-2.10.0 branch from the current 
master,
 > > since it is green. If it turns out to be an inauspicious starting 
point,
 > > we can always reset it.
 > >
 > > Kenn
 > >
 > > [1]

https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com=America%2FLos_Angeles
 > > [2] https://issues.apache.org/jira/projects/BEAM/versions/12344540
 >
 > --
 > Jean-Baptiste Onofré
 > jbono...@apache.org 
 > http://blog.nanthrax.net
 > Talend - http://www.talend.com



Re: Beam snapshots broken

2019-01-03 Thread Alexey Romanenko
Thank you for fixing this, Scott. I also can confirm that it works for me now.

> On 2 Jan 2019, at 22:36, Ismaël Mejía  wrote:
> 
> Thanks Scott, I just tested and the errors I found are fixed and
> everything seems to work now. Snapshots are being correctly updated
> too.
> 
> On Wed, Jan 2, 2019 at 6:19 PM Scott Wegner  wrote:
>> 
>> Yes, I believe this will be fixed with the vendoring release.
>> 
>> I am back from the holidays now and ready to pick this up.
>> 
>> On Thu, Dec 27, 2018 at 10:56 AM Ismaël Mejía  wrote:
>>> 
>>> Thanks Andrew for the info.
>>> I think things can wait a bit considering the time of the year, I just
>>> wanted to raise awareness about the issue.
>>> I suppose that we can wait, it should not be long before the vendoring
>>> release is done and this will fix this if I understood correctly.
>>> If anyone else is blocked on this please contact and we will revert
>>> it, otherwise I suppose we can do this revert locally (as Gleb
>>> mentioned) in the meantime.
>>> 
>>> On Thu, Dec 27, 2018 at 6:24 PM Andrew Pilloud  wrote:
 
 https://issues.apache.org/jira/browse/BEAM-6282
 
 Kenn (and everyone else who has context on this change) are out this week, 
 so I don't think anyone is making progress on it. Is this something that 
 can wait a week or two? If not we should revert 
 https://github.com/apache/beam/pull/7324
 
 Andrew
 
 On Thu, Dec 27, 2018 at 9:07 AM Gleb Kanterov  wrote:
> 
> I can reproduce this on my machine, and reverting 
> https://github.com/apache/beam/pull/7324 fixed the problem. There is a 
> separate thread in dev@ about releasing vendored gRPC v0.2, I'm wondering 
> if it will this issue.
> 
> On Thu, Dec 27, 2018 at 5:20 PM Ismaël Mejía  wrote:
>> 
>> Looks like snapshots are broken again since 20 december, can somebody 
>> PTAL?
>> Seems like some part of the vendoring could be related to this failure
>> (maybe it is looking for the unpublished version)?
>> 
>> Running some tests in one existing application I found this
>> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time
>> elapsed: 0.447 s <<< FAILURE! - in SerializationTest
>> [ERROR] nonSerilizableTest(SerializationTest)  Time elapsed: 0.028 s  
>> <<< ERROR!
>> java.lang.NoClassDefFoundError:
>> org/apache/beam/vendor/grpc/v1p13p1/com/google/protobuf/ProtocolMessageEnum
>>at SerializationTest.nonSerilizableTest(SerializationTest.java:27)
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.beam.vendor.grpc.v1p13p1.com.google.protobuf.ProtocolMessageEnum
>>at SerializationTest.nonSerilizableTest(SerializationTest.java:27)
>> 
>> On Thu, Dec 13, 2018 at 9:13 AM Mark Liu  wrote:
>>> 
>>> Looks like the recent failure (like this job) is related to 
>>> ':beam-sdks-python:test' change introduced in this PR. `./gradlew 
>>> :beam-sdks-python:test` can reproduce the error.
>>> 
>>> Testing a fix in PR7273.
>>> 
>>> On Wed, Dec 12, 2018 at 8:31 AM Yifan Zou  wrote:
 
 Beam9 is offline right now. But, the job also failed on beam4 and 13 
 with "Could not determine the dependencies of task 
 ':beam-sdks-python:test.".
 Seems like the task dependency did not setup properly.
 
 
 
 On Wed, Dec 12, 2018 at 2:03 AM Ismaël Mejía  wrote:
> 
> You are right it seems that it was related to beam9 (wondering if it
> was bad luck that it was always assigned to beam9 or we can improve
> that poor balancing error).
> However it failed again today against beam13 maybe this time is just a
> build issue but seems related to python too.
> 
> On Tue, Dec 11, 2018 at 7:33 PM Boyuan Zhang  
> wrote:
>> 
>> Seems like, all failed jobs are not owing to the single task 
>> failure. There failed task were executed on beam9, which was 
>> rebooted yesterday because python tests failed continuously. +Yifan 
>> Zou may have more useful content here.
>> 
>> On Tue, Dec 11, 2018 at 9:10 AM Ismaël Mejía  
>> wrote:
>>> 
>>> It seems that Beam snapshots are broken since Dec. 2
>>> https://builds.apache.org/view/A-D/view/Beam/job/beam_Release_Gradle_NightlySnapshot/
>>> 
>>> It seems "The :beam-website:startDockerContainer task failed."
>>> Can somebody please take a look.
> 
> 
> 
> --
> Cheers,
> Gleb
>> 
>> 
>> 
>> --
>> 
>> 
>> 
>> 
>> Got feedback? tinyurl.com/swegner-feedback