Re: Performance tests status and anomaly detection proposal

2018-04-14 Thread Tim Robertson
Very nice Dariusz

+1 to putting the daily report on Slack to help prompt early comments and
investigation on change.

A couple very minor comments:

a) Would it make sense to track number of tests run as well as duration?
b) Labels on the Y-axis / a title with units would be a good addition to
the charts (sorry, bugbear of mine)

I've recently added retry-on-failure code to SolrIO and the tests required
doing things like verifying the timeout is observed.  This increased the
test runtime, and your alerting to all devs would help communicate that and
hopefully prompt folk to raise any concerns early.

Thanks for sharing,
Tim


On Sun, Apr 15, 2018 at 1:20 AM, Kenneth Knowles  wrote:

> This is very cool. So is it easy for someone to integrate the proposal to
> regularly run Nexmark benchmarks and get those on the dashboard? (or a
> separate one to keep IOs in their own page)
>
> Kenn
>
> On Fri, Apr 13, 2018 at 9:02 AM Dariusz Aniszewski <
> dariusz.aniszew...@polidea.com> wrote:
>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *Hello Beam devs!As you might already noticed, together with Łukasz
>> Gajowy, Kamil Szewczyk and Katarzyna Kucharczyk (all directly cc’d here)
>> we’re working on adding some performance tests to the project. We were
>> following directions from the Testing I/O Transforms in Apache Beam
>>  site (which we plan to
>> update in near future).We started from testing various FileBasedIOs as part
>> of BEAM-3060 . So far we
>> have tests for: - TextIO (with and without compression)- AvroIO- XmlIO-
>> TFRecordIOthat may run on following filesystems: - local- GCS- HDFS (except
>> for TFRecordIO, see BEAM-3945
>> )Besides FileBasedIOs we
>> also covered: - HadoopInputFormatIO- MongoDBIO- JdbcIO (in this case test
>> was there, but was disabled; we fixed it and enabled)- HCatalogIO
>> (currently in PR )While currently
>> all the tests are maven-based, we responded to ongoing Gradle migration and
>> created PR  that allows running
>> them via Gradle.All of those tests are executed on daily basis on Apache
>> Jenkins  and their results are published to
>> individual BigQuery tables. There is also a dashboard on which tests
>> results may be viewed and
>> compared:https://apache-beam-testing.appspot.com/explore?dashboard=5755685136498688
>> As
>> we have some amount of tests already, we’re currently working on a tool
>> that will analyze the results and search for anomalies, so devs are
>> notified if degraded performance is observed. You can find proposal
>> document
>> here:https://docs.google.com/document/d/1Cb7XVmqe__nA_WCrriAifL-3WCzbZzV4Am5W_SkQLeA
>> We
>> welcome you to share your thoughts on performance tests in general as well
>> as proposed solution for anomaly detection.Best,Dariusz Aniszewski*
>>
>


Re: Performance tests status and anomaly detection proposal

2018-04-14 Thread Kenneth Knowles
This is very cool. So is it easy for someone to integrate the proposal to
regularly run Nexmark benchmarks and get those on the dashboard? (or a
separate one to keep IOs in their own page)

Kenn

On Fri, Apr 13, 2018 at 9:02 AM Dariusz Aniszewski <
dariusz.aniszew...@polidea.com> wrote:

>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *Hello Beam devs!As you might already noticed, together with Łukasz
> Gajowy, Kamil Szewczyk and Katarzyna Kucharczyk (all directly cc’d here)
> we’re working on adding some performance tests to the project. We were
> following directions from the Testing I/O Transforms in Apache Beam
>  site (which we plan to
> update in near future).We started from testing various FileBasedIOs as part
> of BEAM-3060 . So far we
> have tests for: - TextIO (with and without compression)- AvroIO- XmlIO-
> TFRecordIOthat may run on following filesystems: - local- GCS- HDFS (except
> for TFRecordIO, see BEAM-3945
> )Besides FileBasedIOs we
> also covered: - HadoopInputFormatIO- MongoDBIO- JdbcIO (in this case test
> was there, but was disabled; we fixed it and enabled)- HCatalogIO
> (currently in PR )While currently
> all the tests are maven-based, we responded to ongoing Gradle migration and
> created PR  that allows running
> them via Gradle.All of those tests are executed on daily basis on Apache
> Jenkins  and their results are published to
> individual BigQuery tables. There is also a dashboard on which tests
> results may be viewed and
> compared:https://apache-beam-testing.appspot.com/explore?dashboard=5755685136498688
> As
> we have some amount of tests already, we’re currently working on a tool
> that will analyze the results and search for anomalies, so devs are
> notified if degraded performance is observed. You can find proposal
> document
> here:https://docs.google.com/document/d/1Cb7XVmqe__nA_WCrriAifL-3WCzbZzV4Am5W_SkQLeA
> We
> welcome you to share your thoughts on performance tests in general as well
> as proposed solution for anomaly detection.Best,Dariusz Aniszewski*
>


Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-14 Thread Kenneth Knowles
One reason I resist the user/system distinction is that Beam is a
multi-party system with at least SDK, runner, and pipeline. Often there may
be a DSL like SQL or Scio, or similarly someone may be building a platform
for their company where there is no user authoring the pipeline. Should
Scio, SQL, or MyCompanyFramework metrics end up in "user"? Who decides to
tack on the prefix? It looks like it is the SDK harness? Are there just
three namespaces "runner", "sdk", and "user"?  Most of what you'd think of
as "user" version "system" should simply be the different between
dynamically defined & typed metrics and fields in control plane protos. If
that layer of the namespaces is not finite and limited, who can extend make
a valid extension? Just some questions that I think would flesh out the
meaning of the "user" prefix.

Kenn

On Fri, Apr 13, 2018 at 5:26 PM Andrea Foegler  wrote:

>
>
> On Fri, Apr 13, 2018 at 5:00 PM Robert Bradshaw 
> wrote:
>
>> On Fri, Apr 13, 2018 at 3:28 PM Andrea Foegler 
>> wrote:
>>
>>> Thanks, Robert!
>>>
>>> I think my lack of clarity is around the MetricSpec.  Maybe what's in my
>>> head and what's being proposed are the same thing.  When I read that the
>>> MetricSpec describes the proto structure, that sound kind of complicated to
>>> me.  But I may be misinterpreting it.  What I picture is something like a
>>> MetricSpec that looks like (note: my picture looks a lot like Stackdriver
>>> :):
>>>
>>> {
>>> name: "my_timer"
>>>
>>
>> name: "beam:metric:user:my_namespace:my_timer" (assuming we want to keep
>> requiring namespaces). Or "beam:metric:[some non-user designation]"
>>
>
> Sure. Looks good.
>
>
>>
>> labels: { "ptransform" }
>>>
>>
>> How does an SDK act on this information?
>>
>
> The SDK is obligated to submit any metric values for that spec with a
> "ptransform" -> "transformName" in the labels field.  Autogenerating code
> from the spec to avoid typos should be easy.
>
>
>>
>>
>>> type: GAUGE
>>> value_type: int64
>>>
>>
>> I was lumping type and value_type into the same field, as a urn for
>> possibly extensibility, as they're tightly coupled (e.g. quantiles,
>> distributions).
>>
>
> My inclination is that keeping this set relatively small and fixed to a
> set that can be readily exported to external monitoring systems is more
> useful than the added indirection to support extensibility.  Lumping
> together seems reasonable.
>
>
>>
>>
>>> units: SECONDS
>>> description: "Times my stuff"
>>>
>>
>> Are both of these optional metadata, in the form of key-value field, for
>> flattened into the field itself (along with every other kind of metadata
>> you may want to attach)?
>>
>
> Optional metadata in the form of fixed fields.  Is there a use case for
> arbitrary metadata?  What would you do with it when exporting?
>
>
>>
>>
>>> }
>>>
>>> Then metrics submitted would look like:
>>> {
>>> name: "my_timer"
>>> labels: {"ptransform": "MyTransform"}
>>> int_value: 100
>>> }
>>>
>>
>> Yes, or value could be a bytes field that is encoded according to
>> [value_]type above, if we want that extensibility (e.g. if we want to
>> bundle the pardo sub-timings together, we'd need a proto for the value, but
>> that seems to specific to hard code into the basic structure).
>>
>>
> The simplicity coming from the fact that there's only one proto format for
>>> the spec and for the value.  The only thing that varies are the entries in
>>> the map and the value field set.  It's pretty easy to establish contracts
>>> around this type of spec and even generate protos for use the in SDK that
>>> make the expectations explicit.
>>>
>>>
>>> On Fri, Apr 13, 2018 at 2:23 PM Robert Bradshaw 
>>> wrote:
>>>
 On Fri, Apr 13, 2018 at 1:32 PM Kenneth Knowles  wrote:

>
> Or just "beam:counter::" or even
> "beam:metric::" since metrics have a type separate from
> their name.
>

 I proposed keeping the "user" in there to avoid possible clashes with
 the system namespaces. (No preference on counter vs. metric, I wasn't
 trying to imply counter = SumInts)


 On Fri, Apr 13, 2018 at 2:02 PM Andrea Foegler 
 wrote:

> I like the generalization from entity -> labels.  I view the purpose
> of those fields to provide context.  And labels feel like they supports a
> richer set of contexts.
>

 If we think such a generalization provides value, I'm fine with doing
 that now, as sets or key-value maps, if we have good enough examples to
 justify this.


> The URN concept gets a little tricky.  I totally agree that the
> context fields should not be embedded in the name.
> There's a "name" which is the identifier that can be used to
> communicate what context values are supported / allowed for metrics with
> that name (for example, element_count expects a ptransform ID).  But then
> there's the context.  In Stackdriver, this context is a map of key-value
> pairs; the type is consider

Re: Gradle Status [April 6]

2018-04-14 Thread Kenneth Knowles
You shouldn't do :module:cleanTest. If that is necessary that's a major bug
in the build.

Kenn

On Fri, Apr 13, 2018 at 11:46 PM Romain Manni-Bucau 
wrote:

> There is a fake module xxx_test which should have the right classpath but
> since idea compilation is messed up you will still have to run
> :module:cleanTest :module:test --tests org...MyTest.myMethod, even with
> idea which leads to the same latency as the command line :(
>
> Le 13 avr. 2018 22:23, "Eugene Kirpichov"  a écrit :
>
>> While we're talking about running tests in IntelliJ with Gradle...
>> Anybody got advice on how to run a single NeedsRunner test in
>> sdks-java-core, say, ParDoTest? With Maven, I used to just run the test in
>> IntelliJ and specify "runners-direct-java" as the classpath; with Gradle,
>> the best I got is to manually run the direct runner's needsRunnerTests task
>> with specifying --tests=..., but it takes a long time, and IntelliJ treats
>> that as just a gradle task run, not as a test run.
>>
>> On Fri, Apr 13, 2018 at 11:14 AM Reuven Lax  wrote:
>>
>>> Is there a Jira for this 3 second delay? Also you're initial complaint
>>> was not about the 3 second delay, so it wasn't clear that's what you were
>>> complaining about.
>>>
>>> Reuven
>>>
>>> On Fri, Apr 13, 2018 at 4:42 AM Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
 When you launch a test with gradle runner it launches gradle which
 makes loose 3s on a very fast computer and more on a slower (6 on my
 personal one which is already fast but not as much as my work one). We are
 5 to see that regression at least. So there is a reason to not use the
 gradle runner if possible cause when you work and need to debug you are
 just stucked (that is why i switched back to mvn after 15mn, i was loosing
 to much time).

 Switching back to native idea test run would fix it but tests just dont
 work this way for me whatever setup i do :( - missing resources IIRC in
 idea out dir.

 Le 13 avr. 2018 00:07, "Reuven Lax"  a écrit :

 I also don't quite understand what your question is, and it appears
 like Dan spent considerable time trying to reproduce your issue. For the
 record, I have had no issues running tests via Gradle in IntelliJ for the
 past few weeks.

 Reuven

 On Thu, Apr 12, 2018 at 9:47 PM Daniel Oliveira 
 wrote:

> Sorry Romain, I'm not quite sure what you're asking. Can you clarify?
>
> On Thu, Apr 12, 2018 at 12:22 PM Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>> Well you are the only one to not have the drawbacks to use it so
>> maybe dont do it? I know Luke is in holidays but anyone else with the
>> knowledge of why we nees that noise compared to idea native tooling/flow?
>>
>> Le 12 avr. 2018 20:16, "Daniel Oliveira"  a
>> écrit :
>>
>>> Ah, I did not. Thanks Romain.
>>>
>>> I tried it again, restarting in between, and still had no
>>> differences. Since it seems like there's no reason not to use "Gradle 
>>> Test
>>> Runner", I'll mention it in the contributor's guide.
>>>
>>> On Thu, Apr 12, 2018 at 10:31 AM Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
 @Daniel: did you restart in between? Otherwise it does nothing. One
 launches JunitCoreRunner from idea and the other a gradle command.

 Le 12 avr. 2018 19:24, "Daniel Oliveira" 
 a écrit :

> I think it depends on what exactly switching to "Gradle Test
> Runner" from "Platform Test Runner" does. I tried it out on my 
> machine and
> they seem to act identically to each other. The IntelliJ 
> documentation says
> it determines what API to use to run the tests
> , so maybe it's
> usefulness depends on the user's machine, in which case a note about 
> that
> would be useful. Something like: "If your IDE has trouble running 
> tests via
> IDEA shortcuts, try the following steps: [...]"
>
> On Thu, Apr 12, 2018 at 3:29 AM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
>
>> Daniel, actually I did run it with default IDEA JUnit test
>> runner. Then, in “Settings > Build, Execution, Deployment > Build 
>> Tools >
>> Gradle > Runner" I selected “Gradle Test Runner” in “Run tests using”
>> selectbox and it works ok when I run my tests with IDEA shortcuts. 
>> So,
>> probably, we should add this details on
>> https://beam.apache.org/contribute/intellij/ too.
>> What do you think?
>>
>> WBR,
>> Alexey
>>
>> On 11 Apr 2018, at 21:17, Daniel Oliveira 
>> wrote:
>>
>> Alexey, are you referring to tests run with "./gradlew
>> :bea

Re: "Radically modular data ingestion APIs in Apache Beam" @ Strata - slides available

2018-04-14 Thread Eugene Kirpichov
Hi all,

The video is now available. I got it from my Strata account and I have
permission to use and share it freely, so I published it on my own YouTube
page (where there's nothing else...). Perhaps it makes sense to add to the
Beam YouTube channel, but AFAIK only a PMC member can do that.

https://www.youtube.com/watch?v=NIn9E5TVoCA


On Tue, Mar 13, 2018 at 3:33 AM James  wrote:

> Very informative, thanks!
>
> On Fri, Mar 9, 2018 at 4:49 PM Etienne Chauchot 
> wrote:
>
>> Great !
>>
>> Thanks for sharing.
>>
>> Etienne
>>
>> Le jeudi 08 mars 2018 à 19:49 +, Eugene Kirpichov a écrit :
>>
>> Hey all,
>>
>> The slides for my yesterday's talk at Strata San Jose
>> https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/63696
>>  have
>> been posted on the talk page. They may be of interest both to users and IO
>> authors.
>>
>> Thanks.
>>
>>


Build failed in Jenkins: beam_Release_Gradle_NightlySnapshot #5

2018-04-14 Thread Apache Jenkins Server
See 


Changes:

[sidhom] Support impulse transforms in Flink

[sidhom] Add Impulse ValidatesRunner test

[tgroh] Fix materializesWithDifferentEnvConsumer

[tgroh] Reduce Requirements to be considered a Primitve

[ehudm] Add lint checks for modules under sdks/python/.

[kedin] Add Row Json Deserializer

[kedin] Add RowJsonValueExtractors

[aaltay] [BEAM-4028] Adding NameContext to Python SDK. (#5043)

[sidhom] [BEAM-3994] Use typed client pool sinks and sources

[sidhom] [BEAM-3966] Move functional utilities into shared module

[sidhom] Use general functional interfaces in ControlClientPool

[sidhom] Rename createLinked() to createBuffered() in QueueControlClientPool

[github] Add region argument to dataflow.go

[github] Region isn't on proto; create and get instead.

[tgroh] Rename `defaultRegistry` to `javaSdkNativeRegistry`

[altay] Cythonize DistributionAccumulator

--
[...truncated 3.58 MB...]
Appending inputPropertyHash for 'sourceCompatibility' to build cache key: 
12f0ae79f405c46e9045f83b66543728
Appending inputPropertyHash for 'targetCompatibility' to build cache key: 
12f0ae79f405c46e9045f83b66543728
Appending inputPropertyHash for 'toolChain.class' to build cache key: 
1215759b7e13ccf2e05e4874b683b415
Appending inputPropertyHash for 'toolChain.version' to build cache key: 
38f0f60cac62a95635fccb98efcdca07
Appending inputPropertyHash for 'aptOptions.processorpath' to build cache key: 
051963b0442399565bfac889fe6a9239
Appending inputPropertyHash for 'classpath' to build cache key: 
1aa83621e0511d503763cbf33251a046
Appending inputPropertyHash for 'effectiveAnnotationProcessorPath' to build 
cache key: a2e4c852d5efc8968bb1b8fe4bbad192
Appending inputPropertyHash for 'options.bootstrapClasspath' to build cache 
key: d41d8cd98f00b204e9800998ecf8427e
Appending inputPropertyHash for 'options.sourcepath' to build cache key: 
d41d8cd98f00b204e9800998ecf8427e
Appending inputPropertyHash for 'source' to build cache key: 
2410aa9567048636443291fcdca5286b
Appending outputPropertyName to build cache key: destinationDir
Appending outputPropertyName to build cache key: generatedSourcesDestinationDir
Build cache key for task ':beam-sdks-java-nexmark:compileTestJava' is 
747fce101e6e245a06e066f623271115
Task ':beam-sdks-java-nexmark:compileTestJava' is not up-to-date because:
  Executed with '--rerun-tasks'.
All input files are considered out-of-date for incremental task 
':beam-sdks-java-nexmark:compileTestJava'.
Compiling with JDK Java compiler API.
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
:beam-sdks-java-nexmark:compileTestJava (Thread[Task worker for ':',5,main]) 
completed. Took 0.275 secs.
:beam-sdks-java-nexmark:processTestResources (Thread[Task worker for 
':',5,main]) started.
:beam-sdks-java-nexmark:processTestResources
file or directory 
'
 not found
Skipping task ':beam-sdks-java-nexmark:processTestResources' as it has no 
source files and no previous output files.
:beam-sdks-java-nexmark:processTestResources NO-SOURCE
:beam-sdks-java-nexmark:processTestResources (Thread[Task worker for 
':',5,main]) completed. Took 0.0 secs.
:beam-sdks-java-nexmark:testClasses (Thread[Task worker for ':',5,main]) 
started.
:beam-sdks-java-nexmark:testClasses
Skipping task ':beam-sdks-java-nexmark:testClasses' as it has no actions.
:beam-sdks-java-nexmark:testClasses (Thread[Task worker for ':',5,main]) 
completed. Took 0.0 secs.
:beam-sdks-java-nexmark:shadowTestJar (Thread[Task worker for ':',5,main]) 
started.
:beam-sdks-java-nexmark:shadowTestJar
Appending taskClass to build cache key: 
com.github.jengelman.gradle.plugins.shadow.tasks.ShadowJar_Decorated
Appending classLoaderHash to build cache key: dd005137f3927b1054865fe4d8d513f8
Appending actionType to build cache key: 
com.github.jengelman.gradle.plugins.shadow.tasks.ShadowJar_Decorated
Appending actionClassLoaderHash to build cache key: 
dd005137f3927b1054865fe4d8d513f8
Appending inputPropertyHash for 'entryCompression' to build cache key: 
45df62b6ec2b6322be5846df74267294
Appending inputPropertyHash for 'manifestContentCharset' to build cache key: 
d746f44d09fb58d2971f341d24d74c35
Appending inputPropertyHash for 'metadataCharset' to build cache key: 
d746f44d09fb58d2971f341d24d74c35
Appending inputPropertyHash for 'preserveFileTimestamps' to build cache key: 
f6d7ed39fe24031e22d54f3fe65b901c
Appending inputPropertyHash for 'reproducibleFileOrder' to build cache key: 
c06857e9ea338f3f3a24bb78f8fbdf6f
Appending inputPropertyHash for 'rootSpec$1.caseSensitive' to build cache key: 
f6d7ed39fe24031e22d54f3fe65b901c
Appending inputPropertyHash for 'rootSpec$1.destPath' to build cache key: 
f1d3ff8443297732862df21dc4e57262
Appending inputPropertyHash