[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2018-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338612#comment-16338612
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj closed pull request #4401: [BEAM-3060] Support for Perfkit 
execution of file-based-io-tests on HDFS cluster.
URL: https://github.com/apache/beam/pull/4401
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/.test-infra/kubernetes/hadoop/SmallITCluster/pkb-config.yml 
b/.test-infra/kubernetes/hadoop/SmallITCluster/pkb-config.yml
new file mode 100644
index 000..72f458a9bc8
--- /dev/null
+++ b/.test-infra/kubernetes/hadoop/SmallITCluster/pkb-config.yml
@@ -0,0 +1,40 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This file is a pkb benchmark configuration file, used when running the IO ITs
+# that use this data store. It allows users to run tests when they are on a
+# separate network from the kubernetes cluster by reading the hadoop namenode 
IP
+# address from the LoadBalancer service.
+#
+# When running Perfkit with DirectRunner - format pattern must additionally 
contain
+# dfs.client.use.datanode.hostname set to true:
+#   format: 
'[{\"fs.defaultFS\":\"hdfs://{{LoadBalancerIp}}:9000\",\"dfs.replication\":1,\"dfs.client.use.datanode.hostname\":\"true\"
 }]'
+# and /etc/hosts should be modified with an entry containing:
+#   LoadBalancerIp HadoopMasterPodName
+# otherwise hdfs client won't be able to reach datanode.
+# FilenamePrefix is used in file-based-io-tests.
+
+static_pipeline_options:
+dynamic_pipeline_options:
+  - name: hdfsConfiguration
+format: 
'[{\"fs.defaultFS\":\"hdfs://{{LoadBalancerIp}}:9000\",\"dfs.replication\":1}]'
+type: LoadBalancerIp
+serviceName: hadoop-external
+  - name: filenamePrefix
+format: 'hdfs://{{LoadBalancerIp}}:9000/TEXTIO_IT_'
+type: LoadBalancerIp
+serviceName: hadoop-external
diff --git a/sdks/java/io/file-based-io-tests/pom.xml 
b/sdks/java/io/file-based-io-tests/pom.xml
index bd041040bc4..23c1b31c563 100644
--- a/sdks/java/io/file-based-io-tests/pom.xml
+++ b/sdks/java/io/file-based-io-tests/pom.xml
@@ -133,6 +133,110 @@
 
 
 
+
+org.apache.maven.plugins
+maven-surefire-plugin
+${surefire-plugin.version}
+
+true
+
+
+
+
+
+
+
+
+io-it-hdfs-small
+
+io-it-suite-hdfs-small
+
+
+
+
${project.parent.parent.parent.parent.basedir}
+
+
+
+
+org.codehaus.gmaven
+groovy-maven-plugin
+${groovy-maven-plugin.version}
+
+
+find-supported-python-for-compile
+initialize
+
+execute
+
+
+
${beamRootProjectDir}/sdks/python/findSupportedPython.groovy
+
+
+
+
+
+
+org.codehaus.mojo
+exec-maven-plugin
+${maven-exec-plugin.version}
+
+
+verify
+
+exec
+
+
+
+
+

[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2018-01-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16323377#comment-16323377
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj closed pull request #4261: [BEAM-3060] HDFS cluster configuration, 
kubernetes scripts, filebased io support …
URL: https://github.com/apache/beam/pull/4261
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/.test-infra/kubernetes/hadoop/SmallITCluster/hdfs-single-datanode-cluster-for-local-dev.yml
 
b/.test-infra/kubernetes/hadoop/SmallITCluster/hdfs-single-datanode-cluster-for-local-dev.yml
new file mode 100644
index 000..b761137f35b
--- /dev/null
+++ 
b/.test-infra/kubernetes/hadoop/SmallITCluster/hdfs-single-datanode-cluster-for-local-dev.yml
@@ -0,0 +1,46 @@
+#Licensed to the Apache Software Foundation (ASF) under one or more
+#contributor license agreements.  See the NOTICE file distributed with
+#this work for additional information regarding copyright ownership.
+#The ASF licenses this file to You under the Apache License, Version 2.0
+#(the "License"); you may not use this file except in compliance with
+#the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+#
+# This script creates hadoop-external service that allows to connect to hdfs 
cluster from
+# outside world. Running:
+#
+#   kubectl get svc hadoop-external
+#
+# allows to read LoadBalancer EXTERNAL-IP which should be used to interact 
with the hdfs cluster.
+#
+
+apiVersion: v1
+kind: Service
+metadata:
+  name: hadoop-external
+  labels:
+name: hadoop-external
+spec:
+  ports:
+- name: sshd
+  port: 2122
+- name: hdfs
+  port: 9000
+- name: web
+  port: 50070
+- name: datanode
+  port: 50010
+- name: datanode-icp
+  port: 50020
+- name: datanode-http
+  port: 50075
+  selector:
+name: hadoop
+  type: LoadBalancer
diff --git 
a/.test-infra/kubernetes/hadoop/SmallITCluster/hdfs-single-datanode-cluster.yml 
b/.test-infra/kubernetes/hadoop/SmallITCluster/hdfs-single-datanode-cluster.yml
new file mode 100644
index 000..483c2961f4f
--- /dev/null
+++ 
b/.test-infra/kubernetes/hadoop/SmallITCluster/hdfs-single-datanode-cluster.yml
@@ -0,0 +1,83 @@
+#Licensed to the Apache Software Foundation (ASF) under one or more
+#contributor license agreements.  See the NOTICE file distributed with
+#this work for additional information regarding copyright ownership.
+#The ASF licenses this file to You under the Apache License, Version 2.0
+#(the "License"); you may not use this file except in compliance with
+#the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+#
+# This script contains definition of hdfs single node cluster. In this 
configuration hdfs datanode
+# and namenode are running on the same pod. Service hadoop allows to connect 
to pods labeled as
+# hadoop, this service also provides connectivity from outside of the cluster.
+# Replication controller creates pods using docker image 
sequenceiq/hadoop-docker:2.7.1.
+# Each pod created will expose hdfs standard ports.
+#
+
+apiVersion: v1
+kind: Service
+metadata:
+  name: hadoop
+  labels:
+name: hadoop
+spec:
+  ports:
+- name: sshd
+  port: 2122
+- name: hdfs
+  port: 9000
+- name: web
+  port: 50070
+- name: datanode
+  port: 50010
+- name: datanode-icp
+  port: 50020
+- name: datanode-http
+  port: 50075
+  selector:
+name: hadoop
+  type: NodePort
+
+---
+
+apiVersion: v1
+kind: ReplicationController
+metadata:
+  name: hadoop
+  labels:
+name: hadoop
+spec:
+  replicas: 1
+  selector:
+name: hadoop
+  template:
+metadata:
+  labels:
+name: hadoop
+spec:
+  containers:
+- name: hadoop
+  image: sequenceiq/hadoop-docker:2.7.1
+  ports:
+- name: sshd
+  containerPort: 2122
+

[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2018-01-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16323337#comment-16323337
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj closed pull request #4305: [BEAM-3060] Allow to specify timeout for 
FileBasedIOIT ran via PerfKit
URL: https://github.com/apache/beam/pull/4305
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/sdks/java/io/file-based-io-tests/pom.xml 
b/sdks/java/io/file-based-io-tests/pom.xml
index 44119ec79ff..4de2e70615f 100644
--- a/sdks/java/io/file-based-io-tests/pom.xml
+++ b/sdks/java/io/file-based-io-tests/pom.xml
@@ -113,6 +113,7 @@
 ${pkbLocation}
 
-benchmarks=beam_integration_benchmark
 -beam_it_profile=io-it
+
-beam_it_timeout=${pkbTimeout}
 
-beam_location=${beamRootProjectDir}
 -beam_prebuilt=true
 -beam_sdk=java
diff --git a/sdks/java/io/pom.xml b/sdks/java/io/pom.xml
index 07e1b5cb9ff..0710df05d89 100644
--- a/sdks/java/io/pom.xml
+++ b/sdks/java/io/pom.xml
@@ -38,6 +38,7 @@
 
 
 
+600
   
 
   


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2018-01-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16322778#comment-16322778
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj closed pull request #4378: [BEAM-3060] split one job into several 
jobs, one for each IO.
URL: https://github.com/apache/beam/pull/4378
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy 
b/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy
index b41af717168..667b11d2072 100644
--- a/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy
+++ b/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy
@@ -18,60 +18,102 @@
 
 import common_job_properties
 
-// This job runs the file-based IOs performance tests on PerfKit Benchmarker.
-job('beam_PerformanceTests_FileBasedIO_IT') {
-description('Runs PerfKit tests for file-based IOs.')
+def testsConfigurations = [
+[
+jobName   : 'beam_PerformanceTests_TextIOIT',
+jobDescription: 'Runs PerfKit tests for TextIOIT',
+itClass   : 'org.apache.beam.sdk.io.text.TextIOIT',
+bqTable   : 'beam_performance.textioit_pkb_results',
+prCommitStatusName: 'Java TextIO Performance Test',
+prTriggerPhase: 'Run Java TextIO Performance Test',
 
-// Set default Beam job properties.
-common_job_properties.setTopLevelMainJobProperties(delegate)
+],
+[
+jobName: 
'beam_PerformanceTests_Compressed_TextIOIT',
+jobDescription : 'Runs PerfKit tests for TextIOIT with 
GZIP compression',
+itClass: 'org.apache.beam.sdk.io.text.TextIOIT',
+bqTable: 
'beam_performance.compressed_textioit_pkb_results',
+prCommitStatusName : 'Java CompressedTextIO Performance Test',
+prTriggerPhase : 'Run Java CompressedTextIO Performance 
Test',
+extraPipelineArgs: [
+compressionType: 'GZIP'
+]
+],
+[
+jobName   : 'beam_PerformanceTests_AvroIOIT',
+jobDescription: 'Runs PerfKit tests for AvroIOIT',
+itClass   : 'org.apache.beam.sdk.io.avro.AvroIOIT',
+bqTable   : 'beam_performance.avroioit_pkb_results',
+prCommitStatusName: 'Java AvroIO Performance Test',
+prTriggerPhase: 'Run Java AvroIO Performance Test',
+],
+[
+jobName   : 'beam_PerformanceTests_TFRecordIOIT',
+jobDescription: 'Runs PerfKit tests for 
beam_PerformanceTests_TFRecordIOIT',
+itClass   : 
'org.apache.beam.sdk.io.tfrecord.TFRecordIOIT',
+bqTable   : 
'beam_performance.tfrecordioit_pkb_results',
+prCommitStatusName: 'Java TFRecordIO Performance Test',
+prTriggerPhase: 'Run Java TFRecordIO Performance Test',
+],
+]
+
+for (testConfiguration in testsConfigurations) {
+create_filebasedio_performance_test_job(testConfiguration)
+}
 
-// Allows triggering this build against pull requests.
-common_job_properties.enablePhraseTriggeringFromPullRequest(
-delegate,
-'Java FileBasedIOs Performance Test',
-'Run Java FileBasedIOs Performance Test')
 
-// Run job in postcommit every 6 hours, don't trigger every push, and
-// don't email individual committers.
-common_job_properties.setPostCommit(
-delegate,
-'0 */6 * * *',
-false,
-'commits@beam.apache.org',
-false)
+private void create_filebasedio_performance_test_job(testConfiguration) {
 
-def pipelineArgs = [
-project: 'apache-beam-testing',
-tempRoot: 'gs://temp-storage-for-perf-tests',
-numberOfRecords: '100',
-filenamePrefix: 
'gs://temp-storage-for-perf-tests/filebased/${BUILD_ID}/TESTIOIT',
-]
-def pipelineArgList = []
-pipelineArgs.each({
-key, value -> pipelineArgList.add("\"--$key=$value\"")
-})
-def pipelineArgsJoined = "[" + pipelineArgList.join(',') + "]"
+// This job runs the file-based IOs performance tests on PerfKit 
Benchmarker.
+job(testConfiguration.jobName) {
+description(testConfiguration.jobDescription)
 
+// Set default Beam job properties.
+common_job_properties.setTopLevelMainJobProperties(delegate)
 
-def itClasses = 

[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2018-01-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320055#comment-16320055
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

DariuszAniszewski opened a new pull request #4378: [BEAM-3060] split one job 
into several jobs, one for each IO.
URL: https://github.com/apache/beam/pull/4378
 
 
   Follow this checklist to help us incorporate your contribution quickly and 
easily:
   
- [x] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
- [x] Each commit in the pull request should have a meaningful subject line 
and body.
- [x] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
- [x] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
- [x] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   ---
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2018-01-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318194#comment-16318194
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

DariuszAniszewski opened a new pull request #4305: [BEAM-3060] Allow to specify 
timeout for FileBasedIOIT ran via PerfKit
URL: https://github.com/apache/beam/pull/4305
 
 
   with default set to 10 mins (which is PerfKit's timeout).
   
   Background: large-scale tests run via PerfKit were failing. this PR allows 
to specify timeout so tests are passing.
   
   --
   
   Follow this checklist to help us incorporate your contribution quickly and 
easily:
   
- [x] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
- [x] Each commit in the pull request should have a meaningful subject line 
and body.
- [x] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
- [x] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
- [x] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   ---
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2018-01-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318193#comment-16318193
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

DariuszAniszewski closed pull request #4305: [BEAM-3060] Allow to specify 
timeout for FileBasedIOIT ran via PerfKit
URL: https://github.com/apache/beam/pull/4305
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/sdks/java/io/file-based-io-tests/pom.xml 
b/sdks/java/io/file-based-io-tests/pom.xml
index 44119ec79ff..4de2e70615f 100644
--- a/sdks/java/io/file-based-io-tests/pom.xml
+++ b/sdks/java/io/file-based-io-tests/pom.xml
@@ -113,6 +113,7 @@
 ${pkbLocation}
 
-benchmarks=beam_integration_benchmark
 -beam_it_profile=io-it
+
-beam_it_timeout=${pkbTimeout}
 
-beam_location=${beamRootProjectDir}
 -beam_prebuilt=true
 -beam_sdk=java
diff --git a/sdks/java/io/pom.xml b/sdks/java/io/pom.xml
index 07e1b5cb9ff..0710df05d89 100644
--- a/sdks/java/io/pom.xml
+++ b/sdks/java/io/pom.xml
@@ -38,6 +38,7 @@
 
 
 
+600
   
 
   


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-12-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307368#comment-16307368
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj closed pull request #4318: [BEAM-3060] Enable large-scale test for 
FileBasedIOIT
URL: https://github.com/apache/beam/pull/4318
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy 
b/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy
index f24b93238e4..99bab5e10d7 100644
--- a/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy
+++ b/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy
@@ -43,7 +43,7 @@ job('beam_PerformanceTests_FileBasedIO_IT') {
 def pipelineArgs = [
 project: 'apache-beam-testing',
 tempRoot: 'gs://temp-storage-for-perf-tests',
-numberOfRecords: '100',
+numberOfRecords: '1',
 filenamePrefix: 
'gs://temp-storage-for-perf-tests/filebased/${BUILD_ID}/TESTIOIT',
 ]
 def pipelineArgList = []
@@ -62,6 +62,7 @@ job('beam_PerformanceTests_FileBasedIO_IT') {
 itClasses.each {
 def argMap = [
 benchmarks: 'beam_integration_benchmark',
+beam_it_timeout: '1200',
 beam_it_profile: 'io-it',
 beam_prebuilt: 'true',
 beam_sdk: 'java',


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-12-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301316#comment-16301316
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

DariuszAniszewski opened a new pull request #4318: [BEAM-3060] Enable 
large-scale test for FileBasedIOIT
URL: https://github.com/apache/beam/pull/4318
 
 
   Follow this checklist to help us incorporate your contribution quickly and 
easily:
   
- [x] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
- [x] Each commit in the pull request should have a meaningful subject line 
and body.
- [x] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
- [x] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
- [x] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   ---
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-12-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16301314#comment-16301314
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

DariuszAniszewski opened a new pull request #4317: [BEAM-3060] Use dedicated 
BigQuery table for performance tests of FileBasedIOIT
URL: https://github.com/apache/beam/pull/4317
 
 
   Follow this checklist to help us incorporate your contribution quickly and 
easily:
   
- [x] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
- [x] Each commit in the pull request should have a meaningful subject line 
and body.
- [x] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
- [x] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
- [x] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   ---
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-12-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16300248#comment-16300248
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj closed pull request #4304: [BEAM-3060] explicitly use Apache's 
Google project for file-based performance tests
URL: https://github.com/apache/beam/pull/4304
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy 
b/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy
index fc07e2e11e5..f24b93238e4 100644
--- a/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy
+++ b/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy
@@ -41,6 +41,7 @@ job('beam_PerformanceTests_FileBasedIO_IT') {
 false)
 
 def pipelineArgs = [
+project: 'apache-beam-testing',
 tempRoot: 'gs://temp-storage-for-perf-tests',
 numberOfRecords: '100',
 filenamePrefix: 
'gs://temp-storage-for-perf-tests/filebased/${BUILD_ID}/TESTIOIT',


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-12-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16300226#comment-16300226
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

DariuszAniszewski opened a new pull request #4305: [BEAM-3060] Increase timeout 
for FileBasedIOIT...
URL: https://github.com/apache/beam/pull/4305
 
 
   ...to 20 mins by default and option to override.
   
   In PerfKit there is default timeout set to 10 mins. Large-scale tests run 
via PerfKit are failing. this PR resolves it.
   
   --
   
   Follow this checklist to help us incorporate your contribution quickly and 
easily:
   
- [x] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
- [x] Each commit in the pull request should have a meaningful subject line 
and body.
- [x] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
- [x] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
- [x] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   ---
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-12-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16299694#comment-16299694
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

DariuszAniszewski opened a new pull request #4304: [BEAM-3060] explicitly use 
Apache's Google project for file-based performance tests
URL: https://github.com/apache/beam/pull/4304
 
 
   it turns out project must be also defined in pipeline options.
   
   I hope this is last PR of the series of configuring this Jenkins job ;) 
   
   Follow this checklist to help us incorporate your contribution quickly and 
easily:
   
- [x] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
- [x] Each commit in the pull request should have a meaningful subject line 
and body.
- [x] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
- [x] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
- [x] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   ---
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298717#comment-16298717
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj closed pull request #4296: [BEAM-3060] FIX: remove overriding 
Google project in file-based IOs performance tests
URL: https://github.com/apache/beam/pull/4296
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy 
b/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy
index 0cee3d88527..fc07e2e11e5 100644
--- a/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy
+++ b/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy
@@ -42,7 +42,6 @@ job('beam_PerformanceTests_FileBasedIO_IT') {
 
 def pipelineArgs = [
 tempRoot: 'gs://temp-storage-for-perf-tests',
-project: 'apache-beam-io-testing',
 numberOfRecords: '100',
 filenamePrefix: 
'gs://temp-storage-for-perf-tests/filebased/${BUILD_ID}/TESTIOIT',
 ]


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-12-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298142#comment-16298142
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

DariuszAniszewski opened a new pull request #4296: [BEAM-3060] FIX: remove 
overriding Google project in file-based IOs performance tests
URL: https://github.com/apache/beam/pull/4296
 
 
   In #4267 I've accidentally committed own google project name therefore tests 
are failing on Jenkins. This change removes it so test relies on default one.
   
   Follow this checklist to help us incorporate your contribution quickly and 
easily:
   
- [x] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
- [x] Each commit in the pull request should have a meaningful subject line 
and body.
- [x] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
- [x] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
- [x] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   ---
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-12-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297689#comment-16297689
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj closed pull request #4267: [BEAM-3060] job for performance tests of 
file-based IOs
URL: https://github.com/apache/beam/pull/4267
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy 
b/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy
new file mode 100644
index 000..0cee3d88527
--- /dev/null
+++ b/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import common_job_properties
+
+// This job runs the file-based IOs performance tests on PerfKit Benchmarker.
+job('beam_PerformanceTests_FileBasedIO_IT') {
+description('Runs PerfKit tests for file-based IOs.')
+
+// Set default Beam job properties.
+common_job_properties.setTopLevelMainJobProperties(delegate)
+
+// Allows triggering this build against pull requests.
+common_job_properties.enablePhraseTriggeringFromPullRequest(
+delegate,
+'Java FileBasedIOs Performance Test',
+'Run Java FileBasedIOs Performance Test')
+
+// Run job in postcommit every 6 hours, don't trigger every push, and
+// don't email individual committers.
+common_job_properties.setPostCommit(
+delegate,
+'0 */6 * * *',
+false,
+'commits@beam.apache.org',
+false)
+
+def pipelineArgs = [
+tempRoot: 'gs://temp-storage-for-perf-tests',
+project: 'apache-beam-io-testing',
+numberOfRecords: '100',
+filenamePrefix: 
'gs://temp-storage-for-perf-tests/filebased/${BUILD_ID}/TESTIOIT',
+]
+def pipelineArgList = []
+pipelineArgs.each({
+key, value -> pipelineArgList.add("\"--$key=$value\"")
+})
+def pipelineArgsJoined = "[" + pipelineArgList.join(',') + "]"
+
+
+def itClasses = [
+"org.apache.beam.sdk.io.text.TextIOIT",
+"org.apache.beam.sdk.io.avro.AvroIOIT",
+"org.apache.beam.sdk.io.tfrecord.TFRecordIOIT",
+]
+
+itClasses.each {
+def argMap = [
+benchmarks: 'beam_integration_benchmark',
+beam_it_profile: 'io-it',
+beam_prebuilt: 'true',
+beam_sdk: 'java',
+beam_it_module: 'sdks/java/io/file-based-io-tests',
+beam_it_class: "${it}",
+beam_it_options: pipelineArgsJoined,
+beam_extra_mvn_properties: '["filesystem=gcs"]',
+]
+common_job_properties.buildPerformanceTest(delegate, argMap)
+}
+}


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following 

[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-12-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16294500#comment-16294500
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj closed pull request #4260: [BEAM-3060] temporary reshuffle for 
AvroIOIT and TFRecordIOIT
URL: https://github.com/apache/beam/pull/4260
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/avro/AvroIOIT.java
 
b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/avro/AvroIOIT.java
index ce8da3357c9..be0d6df2eb7 100644
--- 
a/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/avro/AvroIOIT.java
+++ 
b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/avro/AvroIOIT.java
@@ -35,6 +35,7 @@
 import org.apache.beam.sdk.transforms.Combine;
 import org.apache.beam.sdk.transforms.DoFn;
 import org.apache.beam.sdk.transforms.ParDo;
+import org.apache.beam.sdk.transforms.Reshuffle;
 import org.apache.beam.sdk.transforms.Values;
 import org.apache.beam.sdk.transforms.View;
 import org.apache.beam.sdk.values.PCollection;
@@ -102,7 +103,8 @@ public void writeThenReadAll() {
 "Write Avro records to files",
 AvroIO.writeGenericRecords(AVRO_SCHEMA).to(filenamePrefix)
 .withOutputFilenames().withSuffix(".avro"))
-.getPerDestinationOutputFilenames().apply(Values.create());
+.getPerDestinationOutputFilenames().apply(Values.create())
+.apply(Reshuffle.viaRandomKey());
 
 PCollection consolidatedHashcode = testFilenames
 .apply("Read all files", AvroIO.readAllGenericRecords(AVRO_SCHEMA))
diff --git 
a/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/tfrecord/TFRecordIOIT.java
 
b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/tfrecord/TFRecordIOIT.java
index b887316b187..3f08d76750c 100644
--- 
a/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/tfrecord/TFRecordIOIT.java
+++ 
b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/tfrecord/TFRecordIOIT.java
@@ -36,6 +36,7 @@
 import org.apache.beam.sdk.transforms.Create;
 import org.apache.beam.sdk.transforms.MapElements;
 import org.apache.beam.sdk.transforms.ParDo;
+import org.apache.beam.sdk.transforms.Reshuffle;
 import org.apache.beam.sdk.transforms.SimpleFunction;
 import org.apache.beam.sdk.transforms.View;
 import org.apache.beam.sdk.values.PCollection;
@@ -110,7 +111,8 @@ public void writeThenReadAll() {
 PCollection consolidatedHashcode = readPipeline
 .apply(TFRecordIO.read().from(filenamePattern).withCompression(AUTO))
 .apply("Transform bytes to strings", MapElements.via(new 
ByteArrayToString()))
-.apply("Calculate hashcode", Combine.globally(new HashingFn()));
+.apply("Calculate hashcode", Combine.globally(new HashingFn()))
+.apply(Reshuffle.viaRandomKey());
 
 String expectedHash = getExpectedHashForLineCount(numberOfTextLines);
 PAssert.thatSingleton(consolidatedHashcode).isEqualTo(expectedHash);


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results 

[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-12-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292714#comment-16292714
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

DariuszAniszewski opened a new pull request #4267: [BEAM-3060] job for 
performance tests of file-based IOs
URL: https://github.com/apache/beam/pull/4267
 
 
   This PR adds Jenkins job that will run currently available performance tests 
of file-based IOs on dataflow via PerfKit.
   
   -
   
   Follow this checklist to help us incorporate your contribution quickly and 
easily:
   
- [x] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
- [x] Each commit in the pull request should have a meaningful subject line 
and body.
- [x] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
- [x] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
- [x] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   ---
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-12-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290896#comment-16290896
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

szewi opened a new pull request #4261: [BEAM-3060] HDFS cluster configuration, 
kubernetes scripts, filebased io support …
URL: https://github.com/apache/beam/pull/4261
 
 
   …for hdfs tests.
   
   Follow this checklist to help us incorporate your contribution quickly and 
easily:
   
- [ ] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
- [ ] Each commit in the pull request should have a meaningful subject line 
and body.
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
- [ ] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
- [ ] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   ---
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-12-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290857#comment-16290857
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

DariuszAniszewski opened a new pull request #4260: [BEAM-3060] temporary 
reshuffle for AvroIOIT and TFRecordIOIT
URL: https://github.com/apache/beam/pull/4260
 
 
   This is extension of #4210 where reshuffling was added only to `TextIOIT`.
   
   Follow this checklist to help us incorporate your contribution quickly and 
easily:
   
- [x] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
- [x] Each commit in the pull request should have a meaningful subject line 
and body.
- [x] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
- [x] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
- [x] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   ---
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-12-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16286682#comment-16286682
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj closed pull request #4238: [BEAM-3060] added support for passing 
extra mvn properties to pkb
URL: https://github.com/apache/beam/pull/4238
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/sdks/java/io/file-based-io-tests/pom.xml 
b/sdks/java/io/file-based-io-tests/pom.xml
index fc523f614fd..44119ec79ff 100644
--- a/sdks/java/io/file-based-io-tests/pom.xml
+++ b/sdks/java/io/file-based-io-tests/pom.xml
@@ -124,6 +124,11 @@
 
-beam_it_class=${fileBasedIoItClass}
 
 
-beam_it_options=${integrationTestPipelineOptions}
+
+
-beam_extra_mvn_properties=${pkbExtraProperties}
 
 
 
diff --git a/sdks/java/io/pom.xml b/sdks/java/io/pom.xml
index 0f8bc78fbe1..07e1b5cb9ff 100644
--- a/sdks/java/io/pom.xml
+++ b/sdks/java/io/pom.xml
@@ -37,6 +37,7 @@
 
 
 
+
   
 
   


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-12-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284139#comment-16284139
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

DariuszAniszewski opened a new pull request #4238: [BEAM-3060] added support 
for passing extra mvn properties to pkb
URL: https://github.com/apache/beam/pull/4238
 
 
   Since [this PR on in 
PerfKit](https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/pull/1544) 
was merged, it's now possible to pass extra properties to be included into 
target mvn command when running tests with PerfKit.
   
   Follow this checklist to help us incorporate your contribution quickly and 
easily:
   
- [ ] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
- [ ] Each commit in the pull request should have a meaningful subject line 
and body.
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
- [ ] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
- [ ] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   ---
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-12-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283054#comment-16283054
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj closed pull request #4210: [BEAM-3060] Temporary fix for failing 
tests on dataflow runner.
URL: https://github.com/apache/beam/pull/4210
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java
 
b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java
index e9aac8001b1..5f3f5406d61 100644
--- 
a/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java
+++ 
b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java
@@ -46,6 +46,7 @@
 import org.apache.beam.sdk.transforms.Combine;
 import org.apache.beam.sdk.transforms.DoFn;
 import org.apache.beam.sdk.transforms.ParDo;
+import org.apache.beam.sdk.transforms.Reshuffle;
 import org.apache.beam.sdk.transforms.Values;
 import org.apache.beam.sdk.transforms.View;
 import org.apache.beam.sdk.values.PCollection;
@@ -118,7 +119,8 @@ public void writeThenReadAll() {
 .apply("Generate sequence", 
GenerateSequence.from(0).to(numberOfTextLines))
 .apply("Produce text lines", ParDo.of(new 
DeterministicallyConstructTestTextLineFn()))
 .apply("Write content to files", write)
-.getPerDestinationOutputFilenames().apply(Values.create());
+.getPerDestinationOutputFilenames().apply(Values.create())
+.apply(Reshuffle.viaRandomKey());
 
 PCollection consolidatedHashcode = testFilenames
 .apply("Read all files", TextIO.readAll().withCompression(AUTO))


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-12-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282749#comment-16282749
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

jkff closed pull request #4209: [BEAM-3060] AvroIOIT
URL: https://github.com/apache/beam/pull/4209
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/sdks/java/io/file-based-io-tests/pom.xml 
b/sdks/java/io/file-based-io-tests/pom.xml
index 812bfea363a..fc523f614fd 100644
--- a/sdks/java/io/file-based-io-tests/pom.xml
+++ b/sdks/java/io/file-based-io-tests/pom.xml
@@ -196,5 +196,11 @@
 beam-sdks-java-io-common
 test
 
+
+org.apache.avro
+avro
+test
+
+
 
 
diff --git 
a/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/avro/AvroIOIT.java
 
b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/avro/AvroIOIT.java
new file mode 100644
index 000..ce8da3357c9
--- /dev/null
+++ 
b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/avro/AvroIOIT.java
@@ -0,0 +1,137 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.io.avro;
+
+import static 
org.apache.beam.sdk.io.common.FileBasedIOITHelper.appendTimestampToPrefix;
+import static 
org.apache.beam.sdk.io.common.FileBasedIOITHelper.getExpectedHashForLineCount;
+import static 
org.apache.beam.sdk.io.common.FileBasedIOITHelper.readTestPipelineOptions;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.GenericRecordBuilder;
+import org.apache.beam.sdk.coders.AvroCoder;
+import org.apache.beam.sdk.io.AvroIO;
+import org.apache.beam.sdk.io.GenerateSequence;
+import org.apache.beam.sdk.io.common.FileBasedIOITHelper;
+import org.apache.beam.sdk.io.common.HashingFn;
+import org.apache.beam.sdk.io.common.IOTestPipelineOptions;
+import org.apache.beam.sdk.testing.PAssert;
+import org.apache.beam.sdk.testing.TestPipeline;
+import org.apache.beam.sdk.transforms.Combine;
+import org.apache.beam.sdk.transforms.DoFn;
+import org.apache.beam.sdk.transforms.ParDo;
+import org.apache.beam.sdk.transforms.Values;
+import org.apache.beam.sdk.transforms.View;
+import org.apache.beam.sdk.values.PCollection;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.JUnit4;
+
+/**
+ * An integration test for {@link AvroIO}.
+ *
+ * Run this test using the command below. Pass in connection information 
via PipelineOptions:
+ * 
+ *  mvn -e -Pio-it verify -pl sdks/java/io/file-based-io-tests
+ *  -Dit.test=org.apache.beam.sdk.io.avro.AvroIOIT
+ *  -DintegrationTestPipelineOptions='[
+ *  "--numberOfRecords=10",
+ *  "--filenamePrefix=output_file_path"
+ *  ]'
+ * 
+ * 
+ * Please see 'sdks/java/io/file-based-io-tests/pom.xml' for instructions 
regarding
+ * running this test using Beam performance testing framework.
+ */
+@RunWith(JUnit4.class)
+public class AvroIOIT {
+
+
+  private static final Schema AVRO_SCHEMA = new Schema.Parser().parse("{\n"
+  + " \"namespace\": \"ioitavro\",\n"
+  + " \"type\": \"record\",\n"
+  + " \"name\": \"TestAvroLine\",\n"
+  + " \"fields\": [\n"
+  + " {\"name\": \"row\", \"type\": \"string\"}\n"
+  + " ]\n"
+  + "}");
+
+  private static String filenamePrefix;
+  private static Long numberOfTextLines;
+
+  @Rule
+  public TestPipeline pipeline = TestPipeline.create();
+
+  @BeforeClass
+  public static void setup() {
+IOTestPipelineOptions options = readTestPipelineOptions();
+
+numberOfTextLines = options.getNumberOfRecords();
+filenamePrefix = appendTimestampToPrefix(options.getFilenamePrefix());
+  }
+
+  @Test
+  public void writeThenReadAll() {
+
+PCollection testFilenames = pipeline
+.apply("Generate sequence", 

[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-12-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281008#comment-16281008
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

jkff closed pull request #4189: [BEAM-3060] add TFRecordIOIT
URL: https://github.com/apache/beam/pull/4189
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java
 
b/sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java
index 5a29d4f8126..e7b475d4caa 100644
--- 
a/sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java
+++ 
b/sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java
@@ -19,6 +19,7 @@
 
 import org.apache.beam.sdk.options.Default;
 import org.apache.beam.sdk.options.Description;
+import org.apache.beam.sdk.options.Validation;
 import org.apache.beam.sdk.testing.TestPipelineOptions;
 
 /**
@@ -96,7 +97,7 @@
   void setNumberOfRecords(Long count);
 
   @Description("Destination prefix for files generated by the test")
-  @Default.String("TEXTIOIT")
+  @Validation.Required
   String getFilenamePrefix();
 
   void setFilenamePrefix(String prefix);
diff --git 
a/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/common/FileBasedIOITHelper.java
 
b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/common/FileBasedIOITHelper.java
new file mode 100644
index 000..cf20d8e5954
--- /dev/null
+++ 
b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/common/FileBasedIOITHelper.java
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common;
+
+import com.google.common.collect.ImmutableMap;
+import com.google.common.collect.Iterables;
+import java.io.IOException;
+import java.util.Collections;
+import java.util.Date;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.Set;
+import org.apache.beam.sdk.io.FileSystems;
+import org.apache.beam.sdk.io.fs.MatchResult;
+import org.apache.beam.sdk.io.fs.ResourceId;
+import org.apache.beam.sdk.options.PipelineOptionsFactory;
+import org.apache.beam.sdk.options.PipelineOptionsValidator;
+import org.apache.beam.sdk.testing.TestPipeline;
+import org.apache.beam.sdk.transforms.DoFn;
+
+/**
+ * Contains helper methods for file based IO Integration tests.
+ */
+public class FileBasedIOITHelper {
+
+  private FileBasedIOITHelper() {
+  }
+
+  public static IOTestPipelineOptions readTestPipelineOptions() {
+PipelineOptionsFactory.register(IOTestPipelineOptions.class);
+IOTestPipelineOptions options = TestPipeline
+.testingPipelineOptions()
+.as(IOTestPipelineOptions.class);
+
+return PipelineOptionsValidator.validate(IOTestPipelineOptions.class, 
options);
+  }
+
+  public static String appendTimestampToPrefix(String filenamePrefix) {
+return String.format("%s_%s", filenamePrefix, new Date().getTime());
+  }
+
+  public static String getExpectedHashForLineCount(Long lineCount) {
+Map expectedHashes = ImmutableMap.of(
+100_000L, "4c8bb3b99dcc59459b20fefba400d446",
+1_000_000L, "9796db06e7a7960f974d5a91164afff1",
+100_000_000L, "6ce05f456e2fdc846ded2abd0ec1de95"
+);
+
+String hash = expectedHashes.get(lineCount);
+if (hash == null) {
+  throw new UnsupportedOperationException(
+  String.format("No hash for that line count: %s", lineCount)
+  );
+}
+return hash;
+  }
+
+  /**
+   * Constructs text lines in files used for testing.
+   */
+  public static class DeterministicallyConstructTestTextLineFn extends 
DoFn {
+
+@ProcessElement
+public void processElement(ProcessContext c) {
+  c.output(String.format("IO IT Test line of text. Line seed: %s", 
c.element()));
+}
+  }
+
+  /**
+   * Deletes 

[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-12-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277099#comment-16277099
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

szewi opened a new pull request #4210: [BEAM-3060] Temporary fix for failing 
tests on dataflow runner.
URL: https://github.com/apache/beam/pull/4210
 
 
   Bug is described in 
https://issues.apache.org/jira/projects/BEAM/issues/BEAM-3268
   
   Follow this checklist to help us incorporate your contribution quickly and 
easily:
   
- [x] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
- [x] Each commit in the pull request should have a meaningful subject line 
and body.
- [x] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
- [x] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
- [x] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   ---
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-12-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277065#comment-16277065
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

szewi closed pull request #4210: [BEAM-3060] Temporary fix for failing tests on 
dataflow runner.
URL: https://github.com/apache/beam/pull/4210
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java
 
b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java
index e9aac8001b1..5f3f5406d61 100644
--- 
a/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java
+++ 
b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java
@@ -46,6 +46,7 @@
 import org.apache.beam.sdk.transforms.Combine;
 import org.apache.beam.sdk.transforms.DoFn;
 import org.apache.beam.sdk.transforms.ParDo;
+import org.apache.beam.sdk.transforms.Reshuffle;
 import org.apache.beam.sdk.transforms.Values;
 import org.apache.beam.sdk.transforms.View;
 import org.apache.beam.sdk.values.PCollection;
@@ -118,7 +119,8 @@ public void writeThenReadAll() {
 .apply("Generate sequence", 
GenerateSequence.from(0).to(numberOfTextLines))
 .apply("Produce text lines", ParDo.of(new 
DeterministicallyConstructTestTextLineFn()))
 .apply("Write content to files", write)
-.getPerDestinationOutputFilenames().apply(Values.create());
+.getPerDestinationOutputFilenames().apply(Values.create())
+.apply(Reshuffle.viaRandomKey());
 
 PCollection consolidatedHashcode = testFilenames
 .apply("Read all files", TextIO.readAll().withCompression(AUTO))


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-12-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277061#comment-16277061
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

szewi opened a new pull request #4210: [BEAM-3060] Temporary fix for failing 
tests on dataflow runner.
URL: https://github.com/apache/beam/pull/4210
 
 
   Bug is described in 
https://issues.apache.org/jira/projects/BEAM/issues/BEAM-3268
   
   Follow this checklist to help us incorporate your contribution quickly and 
easily:
   
- [ ] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
- [ ] Each commit in the pull request should have a meaningful subject line 
and body.
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
- [ ] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
- [ ] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   ---
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-12-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16277033#comment-16277033
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

DariuszAniszewski opened a new pull request #4209: [BEAM-3060] AvroIOIT
URL: https://github.com/apache/beam/pull/4209
 
 
   Added integration test for AvroIO.
   
   **Note:** This branch is based on structural changes introduced in 
TFRecordIOIT (#4189). 
   
   Follow this checklist to help us incorporate your contribution quickly and 
easily:
   
- [ ] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
- [ ] Each commit in the pull request should have a meaningful subject line 
and body.
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
- [ ] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
- [ ] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   ---
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273695#comment-16273695
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

lgajowy commented on issue #4189: [BEAM-3060] add TFRecordIOIT
URL: https://github.com/apache/beam/pull/4189#issuecomment-348361581
 
 
   @jkff Thanks again. Posted new changes. PTAL.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273672#comment-16273672
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

lgajowy commented on a change in pull request #4189: [BEAM-3060] add 
TFRecordIOIT
URL: https://github.com/apache/beam/pull/4189#discussion_r154234823
 
 

 ##
 File path: 
sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/common/AbstractFileBasedIOIT.java
 ##
 @@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common;
+
+import com.google.common.base.Function;
+import com.google.common.collect.FluentIterable;
+import com.google.common.collect.ImmutableMap;
+import com.google.common.collect.Iterables;
+import java.io.IOException;
+import java.util.Collection;
+import java.util.Collections;
+import java.util.Date;
+import java.util.Map;
+import org.apache.beam.sdk.io.Compression;
+import org.apache.beam.sdk.io.FileSystems;
+import org.apache.beam.sdk.io.fs.MatchResult;
+import org.apache.beam.sdk.io.fs.ResourceId;
+import org.apache.beam.sdk.options.PipelineOptionsFactory;
+import org.apache.beam.sdk.testing.TestPipeline;
+import org.apache.beam.sdk.transforms.DoFn;
+
+/**
+ * Abstract class for file based IO Integration tests.
+ */
+public abstract class AbstractFileBasedIOIT {
+
+  protected static IOTestPipelineOptions readTestPipelineOptions() {
+PipelineOptionsFactory.register(IOTestPipelineOptions.class);
+return 
TestPipeline.testingPipelineOptions().as(IOTestPipelineOptions.class);
+  }
+
+  protected static String appendTimestampToPrefix(String filenamePrefix) {
+return String.format("%s_%s", filenamePrefix, new Date().getTime());
+  }
+
+  protected static Compression parseCompressionType(String compressionType) {
+try {
+  return Compression.valueOf(compressionType.toUpperCase());
+} catch (IllegalArgumentException ex) {
+  throw new IllegalArgumentException(
+  String.format("Unsupported compression type: %s", compressionType));
+}
+  }
+
+  protected String getExpectedHashForLineCount(Long lineCount) {
+Map expectedHashes = ImmutableMap.of(
+100_000L, "4c8bb3b99dcc59459b20fefba400d446",
+1_000_000L, "9796db06e7a7960f974d5a91164afff1",
+100_000_000L, "6ce05f456e2fdc846ded2abd0ec1de95"
+);
+
+String hash = expectedHashes.get(lineCount);
+if (hash == null) {
+  throw new UnsupportedOperationException(
+  String.format("No hash for that line count: %s", lineCount)
+  );
+}
+return hash;
+  }
+
+  /**
+   * Constructs text lines in files used for testing.
+   */
+  public static class DeterministicallyConstructTestTextLineFn extends 
DoFn {
+@ProcessElement
+public void processElement(ProcessContext c) {
+  c.output(String.format("IO IT Test line of text. Line seed: %s", 
c.element()));
+}
+  }
+
+  /**
+   * Deletes matching files using the FileSystems API.
+   */
+  public static class DeleteFileFn extends DoFn {
+
+@ProcessElement
+public void processElement(ProcessContext c) throws IOException {
+  MatchResult match = Iterables
+  
.getOnlyElement(FileSystems.match(Collections.singletonList(c.element(;
+
+  Collection resourceIds = toResourceIds(match);
+
+  FileSystems.delete(resourceIds);
+}
+private Collection toResourceIds(MatchResult match) throws 
IOException {
 
 Review comment:
   ok, I guess I got too much inspired by the way it's done in java 8+ :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
>

[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273675#comment-16273675
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

lgajowy commented on a change in pull request #4189: [BEAM-3060] add 
TFRecordIOIT
URL: https://github.com/apache/beam/pull/4189#discussion_r154234884
 
 

 ##
 File path: 
sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/common/AbstractFileBasedIOIT.java
 ##
 @@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common;
+
+import com.google.common.base.Function;
+import com.google.common.collect.FluentIterable;
+import com.google.common.collect.ImmutableMap;
+import com.google.common.collect.Iterables;
+import java.io.IOException;
+import java.util.Collection;
+import java.util.Collections;
+import java.util.Date;
+import java.util.Map;
+import org.apache.beam.sdk.io.Compression;
+import org.apache.beam.sdk.io.FileSystems;
+import org.apache.beam.sdk.io.fs.MatchResult;
+import org.apache.beam.sdk.io.fs.ResourceId;
+import org.apache.beam.sdk.options.PipelineOptionsFactory;
+import org.apache.beam.sdk.testing.TestPipeline;
+import org.apache.beam.sdk.transforms.DoFn;
+
+/**
+ * Abstract class for file based IO Integration tests.
+ */
+public abstract class AbstractFileBasedIOIT {
 
 Review comment:
   Ok, you're totally right about that. I didn't think it thorough well before. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273671#comment-16273671
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

lgajowy commented on a change in pull request #4189: [BEAM-3060] add 
TFRecordIOIT
URL: https://github.com/apache/beam/pull/4189#discussion_r154234810
 
 

 ##
 File path: 
sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/tfrecord/TFRecordIOIT.java
 ##
 @@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.tfrecord;
+
+import static org.apache.beam.sdk.io.Compression.AUTO;
+
+import java.text.ParseException;
+import org.apache.beam.sdk.io.Compression;
+import org.apache.beam.sdk.io.GenerateSequence;
+import org.apache.beam.sdk.io.TFRecordIO;
+import org.apache.beam.sdk.io.common.AbstractFileBasedIOIT;
+import org.apache.beam.sdk.io.common.HashingFn;
+import org.apache.beam.sdk.io.common.IOTestPipelineOptions;
+import org.apache.beam.sdk.testing.PAssert;
+import org.apache.beam.sdk.testing.TestPipeline;
+import org.apache.beam.sdk.transforms.Combine;
+import org.apache.beam.sdk.transforms.Create;
+import org.apache.beam.sdk.transforms.MapElements;
+import org.apache.beam.sdk.transforms.ParDo;
+import org.apache.beam.sdk.transforms.SimpleFunction;
+import org.apache.beam.sdk.transforms.View;
+import org.apache.beam.sdk.values.PCollection;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.JUnit4;
+
+/**
+ * Integration tests for {@link org.apache.beam.sdk.io.TFRecordIO}.
+ *
+ * Run this test using the command below. Pass in connection information 
via PipelineOptions:
+ * 
+ *  mvn -e -Pio-it verify -pl sdks/java/io/file-based-io-tests
+ *  -Dit.test=org.apache.beam.sdk.io.tfrecord.TFRecordIOIT
+ *  -DintegrationTestPipelineOptions='[
+ *  "--numberOfRecords=10",
+ *  "--filenamePrefix=FILEBASEDIOIT"
 
 Review comment:
   Actually thanks to the fact that the to method resolved the path before 
submiting the pipeline to google cloud, we had a path created for us with the 
FILEBASEDIOIT name at the end. It looked like: 
`/Users/lukasz/.../FILEBASEDIOIT` and was valid even for Google Cloud Dataflow. 
I guess this would be an issue on machnes like windows - the path would be 
resolved to something like `c:\lukasz\...\FILEBASEDIOIT`. This would cause an 
error on GCP, right? 
   
   Because of the above, I'll make the filenamePrefix a `@Validation.Required` 
option and change the comment to suggest giving custom path.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and 

[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273674#comment-16273674
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

lgajowy commented on a change in pull request #4189: [BEAM-3060] add 
TFRecordIOIT
URL: https://github.com/apache/beam/pull/4189#discussion_r154234867
 
 

 ##
 File path: 
sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/common/AbstractFileBasedIOIT.java
 ##
 @@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common;
+
+import com.google.common.base.Function;
+import com.google.common.collect.FluentIterable;
+import com.google.common.collect.ImmutableMap;
+import com.google.common.collect.Iterables;
+import java.io.IOException;
+import java.util.Collection;
+import java.util.Collections;
+import java.util.Date;
+import java.util.Map;
+import org.apache.beam.sdk.io.Compression;
+import org.apache.beam.sdk.io.FileSystems;
+import org.apache.beam.sdk.io.fs.MatchResult;
+import org.apache.beam.sdk.io.fs.ResourceId;
+import org.apache.beam.sdk.options.PipelineOptionsFactory;
+import org.apache.beam.sdk.testing.TestPipeline;
+import org.apache.beam.sdk.transforms.DoFn;
+
+/**
+ * Abstract class for file based IO Integration tests.
+ */
+public abstract class AbstractFileBasedIOIT {
+
+  protected static IOTestPipelineOptions readTestPipelineOptions() {
+PipelineOptionsFactory.register(IOTestPipelineOptions.class);
+return 
TestPipeline.testingPipelineOptions().as(IOTestPipelineOptions.class);
+  }
+
+  protected static String appendTimestampToPrefix(String filenamePrefix) {
+return String.format("%s_%s", filenamePrefix, new Date().getTime());
+  }
+
+  protected static Compression parseCompressionType(String compressionType) {
+try {
+  return Compression.valueOf(compressionType.toUpperCase());
 
 Review comment:
   ok. Also on second thought i think it's not the FileBasedIOIT class' 
responsibility to check this stuff. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273673#comment-16273673
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

lgajowy commented on a change in pull request #4189: [BEAM-3060] add 
TFRecordIOIT
URL: https://github.com/apache/beam/pull/4189#discussion_r154234845
 
 

 ##
 File path: 
sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/common/AbstractFileBasedIOIT.java
 ##
 @@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common;
+
+import com.google.common.base.Function;
+import com.google.common.collect.FluentIterable;
+import com.google.common.collect.ImmutableMap;
+import com.google.common.collect.Iterables;
+import java.io.IOException;
+import java.util.Collection;
+import java.util.Collections;
+import java.util.Date;
+import java.util.Map;
+import org.apache.beam.sdk.io.Compression;
+import org.apache.beam.sdk.io.FileSystems;
+import org.apache.beam.sdk.io.fs.MatchResult;
+import org.apache.beam.sdk.io.fs.ResourceId;
+import org.apache.beam.sdk.options.PipelineOptionsFactory;
+import org.apache.beam.sdk.testing.TestPipeline;
+import org.apache.beam.sdk.transforms.DoFn;
+
+/**
+ * Abstract class for file based IO Integration tests.
+ */
+public abstract class AbstractFileBasedIOIT {
+
+  protected static IOTestPipelineOptions readTestPipelineOptions() {
+PipelineOptionsFactory.register(IOTestPipelineOptions.class);
+return 
TestPipeline.testingPipelineOptions().as(IOTestPipelineOptions.class);
+  }
+
+  protected static String appendTimestampToPrefix(String filenamePrefix) {
+return String.format("%s_%s", filenamePrefix, new Date().getTime());
+  }
+
+  protected static Compression parseCompressionType(String compressionType) {
+try {
+  return Compression.valueOf(compressionType.toUpperCase());
+} catch (IllegalArgumentException ex) {
+  throw new IllegalArgumentException(
+  String.format("Unsupported compression type: %s", compressionType));
+}
+  }
+
+  protected String getExpectedHashForLineCount(Long lineCount) {
+Map expectedHashes = ImmutableMap.of(
+100_000L, "4c8bb3b99dcc59459b20fefba400d446",
+1_000_000L, "9796db06e7a7960f974d5a91164afff1",
+100_000_000L, "6ce05f456e2fdc846ded2abd0ec1de95"
+);
+
+String hash = expectedHashes.get(lineCount);
+if (hash == null) {
+  throw new UnsupportedOperationException(
+  String.format("No hash for that line count: %s", lineCount)
+  );
+}
+return hash;
+  }
+
+  /**
+   * Constructs text lines in files used for testing.
+   */
+  public static class DeterministicallyConstructTestTextLineFn extends 
DoFn {
+@ProcessElement
+public void processElement(ProcessContext c) {
+  c.output(String.format("IO IT Test line of text. Line seed: %s", 
c.element()));
+}
+  }
+
+  /**
+   * Deletes matching files using the FileSystems API.
+   */
+  public static class DeleteFileFn extends DoFn {
 
 Review comment:
   Do you suggest creating a separate JIRA for that?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will 

[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273630#comment-16273630
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj closed pull request #4169: [BEAM-3060] Added support for multiple 
filesystems in TextIO
URL: https://github.com/apache/beam/pull/4169
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/sdks/java/io/file-based-io-tests/pom.xml 
b/sdks/java/io/file-based-io-tests/pom.xml
index 6c3a7e3718b..812bfea363a 100644
--- a/sdks/java/io/file-based-io-tests/pom.xml
+++ b/sdks/java/io/file-based-io-tests/pom.xml
@@ -139,6 +139,24 @@
 
 
 
+
+
+google-cloud-storage
+
+
+filesystem
+gcs
+
+
+
+
+org.apache.beam
+
beam-sdks-java-io-google-cloud-platform
+runtime
+
+
+
 
 
 


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273472#comment-16273472
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj commented on issue #4169: [BEAM-3060] Added support for multiple 
filesystems in TextIO
URL: https://github.com/apache/beam/pull/4169#issuecomment-348325070
 
 
   LGTM


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273473#comment-16273473
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj commented on issue #4169: [BEAM-3060] Added support for multiple 
filesystems in TextIO
URL: https://github.com/apache/beam/pull/4169#issuecomment-348325116
 
 
   Run Java PreCommit


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271615#comment-16271615
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

jkff commented on a change in pull request #4189: [BEAM-3060] add TFRecordIOIT
URL: https://github.com/apache/beam/pull/4189#discussion_r153927154
 
 

 ##
 File path: 
sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/common/AbstractFileBasedIOIT.java
 ##
 @@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common;
+
+import com.google.common.base.Function;
+import com.google.common.collect.FluentIterable;
+import com.google.common.collect.ImmutableMap;
+import com.google.common.collect.Iterables;
+import java.io.IOException;
+import java.util.Collection;
+import java.util.Collections;
+import java.util.Date;
+import java.util.Map;
+import org.apache.beam.sdk.io.Compression;
+import org.apache.beam.sdk.io.FileSystems;
+import org.apache.beam.sdk.io.fs.MatchResult;
+import org.apache.beam.sdk.io.fs.ResourceId;
+import org.apache.beam.sdk.options.PipelineOptionsFactory;
+import org.apache.beam.sdk.testing.TestPipeline;
+import org.apache.beam.sdk.transforms.DoFn;
+
+/**
+ * Abstract class for file based IO Integration tests.
+ */
+public abstract class AbstractFileBasedIOIT {
+
+  protected static IOTestPipelineOptions readTestPipelineOptions() {
+PipelineOptionsFactory.register(IOTestPipelineOptions.class);
+return 
TestPipeline.testingPipelineOptions().as(IOTestPipelineOptions.class);
+  }
+
+  protected static String appendTimestampToPrefix(String filenamePrefix) {
+return String.format("%s_%s", filenamePrefix, new Date().getTime());
+  }
+
+  protected static Compression parseCompressionType(String compressionType) {
+try {
+  return Compression.valueOf(compressionType.toUpperCase());
 
 Review comment:
   Not sure this function is giving much benefit. I think it's not too much to 
ask from a user to specify compression type in uppercase, and also we're 
catching an IllegalArgumentException and throwing the same exception. I suggest 
to just use valueOf() instead of this whole function


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271616#comment-16271616
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

jkff commented on a change in pull request #4189: [BEAM-3060] add TFRecordIOIT
URL: https://github.com/apache/beam/pull/4189#discussion_r153928262
 
 

 ##
 File path: 
sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/tfrecord/TFRecordIOIT.java
 ##
 @@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.tfrecord;
+
+import static org.apache.beam.sdk.io.Compression.AUTO;
+
+import java.text.ParseException;
+import org.apache.beam.sdk.io.Compression;
+import org.apache.beam.sdk.io.GenerateSequence;
+import org.apache.beam.sdk.io.TFRecordIO;
+import org.apache.beam.sdk.io.common.AbstractFileBasedIOIT;
+import org.apache.beam.sdk.io.common.HashingFn;
+import org.apache.beam.sdk.io.common.IOTestPipelineOptions;
+import org.apache.beam.sdk.testing.PAssert;
+import org.apache.beam.sdk.testing.TestPipeline;
+import org.apache.beam.sdk.transforms.Combine;
+import org.apache.beam.sdk.transforms.Create;
+import org.apache.beam.sdk.transforms.MapElements;
+import org.apache.beam.sdk.transforms.ParDo;
+import org.apache.beam.sdk.transforms.SimpleFunction;
+import org.apache.beam.sdk.transforms.View;
+import org.apache.beam.sdk.values.PCollection;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.JUnit4;
+
+/**
+ * Integration tests for {@link org.apache.beam.sdk.io.TFRecordIO}.
+ *
+ * Run this test using the command below. Pass in connection information 
via PipelineOptions:
+ * 
+ *  mvn -e -Pio-it verify -pl sdks/java/io/file-based-io-tests
+ *  -Dit.test=org.apache.beam.sdk.io.tfrecord.TFRecordIOIT
+ *  -DintegrationTestPipelineOptions='[
+ *  "--numberOfRecords=10",
+ *  "--filenamePrefix=FILEBASEDIOIT"
 
 Review comment:
   This would not actually be a valid prefix, right? It should be a real path, 
e.g. `gs://some-bucket/output`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271613#comment-16271613
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

jkff commented on a change in pull request #4189: [BEAM-3060] add TFRecordIOIT
URL: https://github.com/apache/beam/pull/4189#discussion_r153927538
 
 

 ##
 File path: 
sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/common/AbstractFileBasedIOIT.java
 ##
 @@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common;
+
+import com.google.common.base.Function;
+import com.google.common.collect.FluentIterable;
+import com.google.common.collect.ImmutableMap;
+import com.google.common.collect.Iterables;
+import java.io.IOException;
+import java.util.Collection;
+import java.util.Collections;
+import java.util.Date;
+import java.util.Map;
+import org.apache.beam.sdk.io.Compression;
+import org.apache.beam.sdk.io.FileSystems;
+import org.apache.beam.sdk.io.fs.MatchResult;
+import org.apache.beam.sdk.io.fs.ResourceId;
+import org.apache.beam.sdk.options.PipelineOptionsFactory;
+import org.apache.beam.sdk.testing.TestPipeline;
+import org.apache.beam.sdk.transforms.DoFn;
+
+/**
+ * Abstract class for file based IO Integration tests.
+ */
+public abstract class AbstractFileBasedIOIT {
+
+  protected static IOTestPipelineOptions readTestPipelineOptions() {
+PipelineOptionsFactory.register(IOTestPipelineOptions.class);
+return 
TestPipeline.testingPipelineOptions().as(IOTestPipelineOptions.class);
+  }
+
+  protected static String appendTimestampToPrefix(String filenamePrefix) {
+return String.format("%s_%s", filenamePrefix, new Date().getTime());
+  }
+
+  protected static Compression parseCompressionType(String compressionType) {
+try {
+  return Compression.valueOf(compressionType.toUpperCase());
+} catch (IllegalArgumentException ex) {
+  throw new IllegalArgumentException(
+  String.format("Unsupported compression type: %s", compressionType));
+}
+  }
+
+  protected String getExpectedHashForLineCount(Long lineCount) {
+Map expectedHashes = ImmutableMap.of(
+100_000L, "4c8bb3b99dcc59459b20fefba400d446",
+1_000_000L, "9796db06e7a7960f974d5a91164afff1",
+100_000_000L, "6ce05f456e2fdc846ded2abd0ec1de95"
+);
+
+String hash = expectedHashes.get(lineCount);
+if (hash == null) {
+  throw new UnsupportedOperationException(
+  String.format("No hash for that line count: %s", lineCount)
+  );
+}
+return hash;
+  }
+
+  /**
+   * Constructs text lines in files used for testing.
+   */
+  public static class DeterministicallyConstructTestTextLineFn extends 
DoFn {
+@ProcessElement
+public void processElement(ProcessContext c) {
+  c.output(String.format("IO IT Test line of text. Line seed: %s", 
c.element()));
+}
+  }
+
+  /**
+   * Deletes matching files using the FileSystems API.
+   */
+  public static class DeleteFileFn extends DoFn {
 
 Review comment:
   Wonder if this makes sense to be in FileIO - 
PCollection.apply(FileIO.delete()) or something like that. Might be 
outside the scope of this PR though.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage 

[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271614#comment-16271614
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

jkff commented on a change in pull request #4189: [BEAM-3060] add TFRecordIOIT
URL: https://github.com/apache/beam/pull/4189#discussion_r153927827
 
 

 ##
 File path: 
sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/common/AbstractFileBasedIOIT.java
 ##
 @@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common;
+
+import com.google.common.base.Function;
+import com.google.common.collect.FluentIterable;
+import com.google.common.collect.ImmutableMap;
+import com.google.common.collect.Iterables;
+import java.io.IOException;
+import java.util.Collection;
+import java.util.Collections;
+import java.util.Date;
+import java.util.Map;
+import org.apache.beam.sdk.io.Compression;
+import org.apache.beam.sdk.io.FileSystems;
+import org.apache.beam.sdk.io.fs.MatchResult;
+import org.apache.beam.sdk.io.fs.ResourceId;
+import org.apache.beam.sdk.options.PipelineOptionsFactory;
+import org.apache.beam.sdk.testing.TestPipeline;
+import org.apache.beam.sdk.transforms.DoFn;
+
+/**
+ * Abstract class for file based IO Integration tests.
+ */
+public abstract class AbstractFileBasedIOIT {
 
 Review comment:
   There are no abstract methods in this class, it's just a collection of 
utility methods. In general inheritance is harder to deal with than 
composition. I suggest to change this class to be non-abstract but have a 
private constructor (non-instantiable), and have callers call its static 
methods directly rather than inheriting from the class.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271617#comment-16271617
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

jkff commented on a change in pull request #4189: [BEAM-3060] add TFRecordIOIT
URL: https://github.com/apache/beam/pull/4189#discussion_r153927333
 
 

 ##
 File path: 
sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/common/AbstractFileBasedIOIT.java
 ##
 @@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.beam.sdk.io.common;
+
+import com.google.common.base.Function;
+import com.google.common.collect.FluentIterable;
+import com.google.common.collect.ImmutableMap;
+import com.google.common.collect.Iterables;
+import java.io.IOException;
+import java.util.Collection;
+import java.util.Collections;
+import java.util.Date;
+import java.util.Map;
+import org.apache.beam.sdk.io.Compression;
+import org.apache.beam.sdk.io.FileSystems;
+import org.apache.beam.sdk.io.fs.MatchResult;
+import org.apache.beam.sdk.io.fs.ResourceId;
+import org.apache.beam.sdk.options.PipelineOptionsFactory;
+import org.apache.beam.sdk.testing.TestPipeline;
+import org.apache.beam.sdk.transforms.DoFn;
+
+/**
+ * Abstract class for file based IO Integration tests.
+ */
+public abstract class AbstractFileBasedIOIT {
+
+  protected static IOTestPipelineOptions readTestPipelineOptions() {
+PipelineOptionsFactory.register(IOTestPipelineOptions.class);
+return 
TestPipeline.testingPipelineOptions().as(IOTestPipelineOptions.class);
+  }
+
+  protected static String appendTimestampToPrefix(String filenamePrefix) {
+return String.format("%s_%s", filenamePrefix, new Date().getTime());
+  }
+
+  protected static Compression parseCompressionType(String compressionType) {
+try {
+  return Compression.valueOf(compressionType.toUpperCase());
+} catch (IllegalArgumentException ex) {
+  throw new IllegalArgumentException(
+  String.format("Unsupported compression type: %s", compressionType));
+}
+  }
+
+  protected String getExpectedHashForLineCount(Long lineCount) {
+Map expectedHashes = ImmutableMap.of(
+100_000L, "4c8bb3b99dcc59459b20fefba400d446",
+1_000_000L, "9796db06e7a7960f974d5a91164afff1",
+100_000_000L, "6ce05f456e2fdc846ded2abd0ec1de95"
+);
+
+String hash = expectedHashes.get(lineCount);
+if (hash == null) {
+  throw new UnsupportedOperationException(
+  String.format("No hash for that line count: %s", lineCount)
+  );
+}
+return hash;
+  }
+
+  /**
+   * Constructs text lines in files used for testing.
+   */
+  public static class DeterministicallyConstructTestTextLineFn extends 
DoFn {
+@ProcessElement
+public void processElement(ProcessContext c) {
+  c.output(String.format("IO IT Test line of text. Line seed: %s", 
c.element()));
+}
+  }
+
+  /**
+   * Deletes matching files using the FileSystems API.
+   */
+  public static class DeleteFileFn extends DoFn {
+
+@ProcessElement
+public void processElement(ProcessContext c) throws IOException {
+  MatchResult match = Iterables
+  
.getOnlyElement(FileSystems.match(Collections.singletonList(c.element(;
+
+  Collection resourceIds = toResourceIds(match);
+
+  FileSystems.delete(resourceIds);
+}
+private Collection toResourceIds(MatchResult match) throws 
IOException {
 
 Review comment:
   This function occupies probably about 2x the amount of code that a simple 
loop would :) Suggest to inline it and replace with a loop.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: 

[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16269723#comment-16269723
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj commented on issue #4189: [BEAM-3060] add TFRecordIOIT
URL: https://github.com/apache/beam/pull/4189#issuecomment-347701609
 
 
   R: @jkff can you take this ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16269717#comment-16269717
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

lgajowy opened a new pull request #4189: [BEAM-3060] add TFRecordIOIT
URL: https://github.com/apache/beam/pull/4189
 
 
   Follow this checklist to help us incorporate your contribution quickly and 
easily:
   
- [ ] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
- [ ] Each commit in the pull request should have a meaningful subject line 
and body.
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
- [ ] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
- [ ] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   ---
   
   Another test for the 3060 task. This one uses two pipelines (there seems to 
be no other way yet). I issued a JIRA regarding that: 
https://issues.apache.org/jira/browse/BEAM-3267
   
   @chamikaramj could you take a look?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16267958#comment-16267958
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj commented on a change in pull request #4169: [BEAM-3060] Added 
support for multiple filesystems in TextIO
URL: https://github.com/apache/beam/pull/4169#discussion_r153376493
 
 

 ##
 File path: 
sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java
 ##
 @@ -81,13 +81,38 @@ public static void setup() throws ParseException {
 .as(IOTestPipelineOptions.class);
 
 numberOfTextLines = options.getNumberOfRecords();
-filenamePrefix = appendTimestamp(options.getFilenamePrefix());
+filenamePrefix = resolveProtocolAndPath(options);
   }
 
   private static String appendTimestamp(String filenamePrefix) {
 return String.format("%s_%s", filenamePrefix, new Date().getTime());
   }
 
+  private static String resolveProtocolAndPath(IOTestPipelineOptions options) {
 
 Review comment:
   Yeah, we can have validate that makes sure that fileSystem property and 
fileNamePrefix match.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16267957#comment-16267957
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj commented on a change in pull request #4169: [BEAM-3060] Added 
support for multiple filesystems in TextIO
URL: https://github.com/apache/beam/pull/4169#discussion_r153376819
 
 

 ##
 File path: sdks/java/io/file-based-io-tests/pom.xml
 ##
 @@ -139,6 +139,24 @@
 
 
 
+
+
+google-cloud-storage
+
+
+filesystem
+GCS
 
 Review comment:
   Will one of the solutions mentioned here work - 
https://stackoverflow.com/questions/10521860/property-autocapitalization-in-maven
 ?
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16267956#comment-16267956
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj commented on a change in pull request #4169: [BEAM-3060] Added 
support for multiple filesystems in TextIO
URL: https://github.com/apache/beam/pull/4169#discussion_r153375847
 
 

 ##
 File path: 
sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java
 ##
 @@ -100,4 +100,14 @@
   String getFilenamePrefix();
 
   void setFilenamePrefix(String prefix);
+
+  @Description("Google cloud storage - bucket_name/path")
+  String getGcsLocation();
 
 Review comment:
   I think it makes sense to simplify and make 'fileNamePrefix' the full 
prefix. Later if we hit a case where this is inadequate we can revisit.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16267683#comment-16267683
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj commented on issue #4149: [BEAM-3060] Add Compressed TextIOIT
URL: https://github.com/apache/beam/pull/4149#issuecomment-347345709
 
 
   Merged. Closing.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16267298#comment-16267298
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

szewi commented on a change in pull request #4169: [BEAM-3060] Added support 
for multiple filesystems in TextIO
URL: https://github.com/apache/beam/pull/4169#discussion_r153293511
 
 

 ##
 File path: 
sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java
 ##
 @@ -100,4 +100,14 @@
   String getFilenamePrefix();
 
   void setFilenamePrefix(String prefix);
+
+  @Description("Google cloud storage - bucket_name/path")
+  String getGcsLocation();
 
 Review comment:
   We can use `--filenamePrefix`, but then we need to provide full 
communication scheme there for GCS or HDFS, for instance 
`gs://bucket/path/file` or `hdfs://hadoop-master:port/dfs-path/file`. If we 
assume that user running tests will know it then those two gcsLocation and 
hdfsLocation could be ommited. This is basically implementation of our proposal 
https://docs.google.com/document/d/1dA-5s6OHiP_cz-NRAbwapoKF5MEC1wKps4A5tFbIPKE/edit#heading=h.29mfbxd6kc64
 . Do you think would be better to remove those two pipeline options and just 
depend on filenamePrefix ? Should I also remove protocol resolving part then ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16267290#comment-16267290
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

szewi commented on a change in pull request #4169: [BEAM-3060] Added support 
for multiple filesystems in TextIO
URL: https://github.com/apache/beam/pull/4169#discussion_r153293998
 
 

 ##
 File path: sdks/java/io/file-based-io-tests/pom.xml
 ##
 @@ -139,6 +139,24 @@
 
 
 
+
+
+google-cloud-storage
+
+
+filesystem
+GCS
 
 Review comment:
   When provided -Dfilesystem=gcs it won't activate this profile. We should 
make decision whether uppercased or lowercased value of property is better. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16267286#comment-16267286
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

szewi commented on a change in pull request #4169: [BEAM-3060] Added support 
for multiple filesystems in TextIO
URL: https://github.com/apache/beam/pull/4169#discussion_r153293511
 
 

 ##
 File path: 
sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java
 ##
 @@ -100,4 +100,14 @@
   String getFilenamePrefix();
 
   void setFilenamePrefix(String prefix);
+
+  @Description("Google cloud storage - bucket_name/path")
+  String getGcsLocation();
 
 Review comment:
   We can use `--filenamePrefix`, but then we need to provide full 
communication scheme there for GCS or HDFS, for instance 
`gs://bucket/path/file` or `hdfs://hadoop-master:port/dfs-path/file`. If we 
assume that user running tests will know it then those two gcsLocation and 
hdfsLocation could be ommited. This is basically implementation of our proposal 
https://docs.google.com/document/d/1dA-5s6OHiP_cz-NRAbwapoKF5MEC1wKps4A5tFbIPKE/edit#heading=h.29mfbxd6kc64
 . Do you think would be better to remove those two pipeline options and just 
depend on pipelinePrefix ? Should I also remove protocol resolving part then ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265650#comment-16265650
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj commented on a change in pull request #4169: [BEAM-3060] Added 
support for multiple filesystems in TextIO
URL: https://github.com/apache/beam/pull/4169#discussion_r153042053
 
 

 ##
 File path: sdks/java/io/file-based-io-tests/pom.xml
 ##
 @@ -139,6 +139,24 @@
 
 
 
+
+
+google-cloud-storage
+
+
+filesystem
+GCS
 
 Review comment:
   Does this require GCS to be all caps ? If so is there a way to not require 
that ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265649#comment-16265649
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj commented on a change in pull request #4169: [BEAM-3060] Added 
support for multiple filesystems in TextIO
URL: https://github.com/apache/beam/pull/4169#discussion_r153042101
 
 

 ##
 File path: 
sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java
 ##
 @@ -81,13 +81,38 @@ public static void setup() throws ParseException {
 .as(IOTestPipelineOptions.class);
 
 numberOfTextLines = options.getNumberOfRecords();
-filenamePrefix = appendTimestamp(options.getFilenamePrefix());
+filenamePrefix = resolveProtocolAndPath(options);
   }
 
   private static String appendTimestamp(String filenamePrefix) {
 return String.format("%s_%s", filenamePrefix, new Date().getTime());
   }
 
+  private static String resolveProtocolAndPath(IOTestPipelineOptions options) {
 
 Review comment:
   I'm not sure why we need to parse and reassemble protocol here. We shouldn't 
have to do this if we ask user to give the full prefix that includes the 
protocol.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265648#comment-16265648
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj commented on a change in pull request #4169: [BEAM-3060] Added 
support for multiple filesystems in TextIO
URL: https://github.com/apache/beam/pull/4169#discussion_r153042000
 
 

 ##
 File path: 
sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java
 ##
 @@ -100,4 +100,14 @@
   String getFilenamePrefix();
 
   void setFilenamePrefix(String prefix);
+
+  @Description("Google cloud storage - bucket_name/path")
+  String getGcsLocation();
 
 Review comment:
   Why don't we use fileNamePrefix for all file-systems instead of introducing 
a property per file-system ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265643#comment-16265643
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj commented on issue #4149: [BEAM-3060] Add Compressed TextIOIT
URL: https://github.com/apache/beam/pull/4149#issuecomment-346929621
 
 
   LGTM


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265420#comment-16265420
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

lgajowy commented on issue #4149: [BEAM-3060] Add Compressed TextIOIT
URL: https://github.com/apache/beam/pull/4149#issuecomment-346866400
 
 
   @chamikaramj thanks for the review! Here's another batch of changes, as 
commented above. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265393#comment-16265393
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

lgajowy commented on a change in pull request #4149: [BEAM-3060] Add Compressed 
TextIOIT
URL: https://github.com/apache/beam/pull/4149#discussion_r152999363
 
 

 ##
 File path: 
sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java
 ##
 @@ -83,25 +90,82 @@ private static String appendTimestamp(String 
filenamePrefix) {
 return String.format("%s_%s", filenamePrefix, new Date().getTime());
   }
 
-  @Test
-  public void writeThenReadAll() {
-PCollection testFilenames = pipeline
-.apply("Generate sequence", 
GenerateSequence.from(0).to(numberOfTextLines))
-.apply("Produce text lines", ParDo.of(new 
DeterministicallyConstructTestTextLineFn()))
-.apply("Write content to files", 
TextIO.write().to(filenamePrefix).withOutputFilenames())
-.getPerDestinationOutputFilenames().apply(Values.create());
+  /** IO IT with no compression. */
+  @RunWith(JUnit4.class)
+  public static class UncompressedTextIOIT {
+
+@Rule
+public TestPipeline pipeline = TestPipeline.create();
+
+@Test
+public void writeThenReadAll() {
+  PCollection testFilenames = pipeline
+  .apply("Generate sequence", 
GenerateSequence.from(0).to(numberOfTextLines))
+  .apply("Produce text lines", ParDo.of(new 
DeterministicallyConstructTestTextLineFn()))
+  .apply("Write content to files", 
TextIO.write().to(filenamePrefix).withOutputFilenames())
+  .getPerDestinationOutputFilenames().apply(Values.create());
+
+  PCollection consolidatedHashcode = testFilenames
+  .apply("Read all files", TextIO.readAll())
+  .apply("Calculate hashcode", Combine.globally(new HashingFn()));
+
+  String expectedHash = getExpectedHashForLineCount(numberOfTextLines);
+  PAssert.thatSingleton(consolidatedHashcode).isEqualTo(expectedHash);
+
+  testFilenames.apply("Delete test files", ParDo.of(new DeleteFileFn())
+  
.withSideInputs(consolidatedHashcode.apply(View.asSingleton(;
+
+  pipeline.run().waitUntilFinish();
+}
+  }
+
+  /** IO IT with various compression types. */
+  @RunWith(Parameterized.class)
+  public static class CompressedTextIOIT {
+
+@Rule
+public TestPipeline pipeline = TestPipeline.create();
+
+@Parameterized.Parameters()
+public static Iterable data() {
+  return ImmutableList.builder()
+  .add(GZIP)
+  .add(DEFLATE)
+  .add(BZIP2)
+  .build();
+}
+
+@Parameterized.Parameter()
+public Compression compression;
+
+@Test
+public void writeThenReadAllWithCompression() {
+  TextIO.TypedWrite write = TextIO
+  .write()
+  .to(filenamePrefix)
+  .withOutputFilenames()
+  .withCompression(compression);
+
+  TextIO.ReadAll read = TextIO.readAll().withCompression(AUTO);
 
-PCollection consolidatedHashcode = testFilenames
-.apply("Read all files", TextIO.readAll())
-.apply("Calculate hashcode", Combine.globally(new HashingFn()));
+  PCollection testFilenames = pipeline
 
 Review comment:
   I think it's hard to do right now without modifying perfkit's code. As we 
checked, perfkit ignores -D parameters because builds the mvn verify command by 
itself from the parameters passed . I think this could be done in some future 
contribution. We will file a bug report in perfkit soon. 
   
   I think the best solution (at least for now) is to leave the compression 
type in pipeline options. We pass them to perfkit either way (through 
`beam_it_options`) and, what imo is more important, compressionType is very 
test specific (same as numberOfRecords). WDYT?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> 

[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265386#comment-16265386
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

lgajowy commented on a change in pull request #4149: [BEAM-3060] Add Compressed 
TextIOIT
URL: https://github.com/apache/beam/pull/4149#discussion_r152998058
 
 

 ##
 File path: 
sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java
 ##
 @@ -83,25 +90,82 @@ private static String appendTimestamp(String 
filenamePrefix) {
 return String.format("%s_%s", filenamePrefix, new Date().getTime());
   }
 
-  @Test
-  public void writeThenReadAll() {
-PCollection testFilenames = pipeline
-.apply("Generate sequence", 
GenerateSequence.from(0).to(numberOfTextLines))
-.apply("Produce text lines", ParDo.of(new 
DeterministicallyConstructTestTextLineFn()))
-.apply("Write content to files", 
TextIO.write().to(filenamePrefix).withOutputFilenames())
-.getPerDestinationOutputFilenames().apply(Values.create());
+  /** IO IT with no compression. */
+  @RunWith(JUnit4.class)
+  public static class UncompressedTextIOIT {
 
 Review comment:
   Yes, it works but runs all the 4 tests that are there in the file. But now I 
think this is probably not what we want. This won't be a problem as you 
suggested an even better solution in the comment below. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265384#comment-16265384
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

lgajowy commented on a change in pull request #4149: [BEAM-3060] Add Compressed 
TextIOIT
URL: https://github.com/apache/beam/pull/4149#discussion_r152998002
 
 

 ##
 File path: 
sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java
 ##
 @@ -83,25 +90,82 @@ private static String appendTimestamp(String 
filenamePrefix) {
 return String.format("%s_%s", filenamePrefix, new Date().getTime());
   }
 
-  @Test
-  public void writeThenReadAll() {
-PCollection testFilenames = pipeline
-.apply("Generate sequence", 
GenerateSequence.from(0).to(numberOfTextLines))
-.apply("Produce text lines", ParDo.of(new 
DeterministicallyConstructTestTextLineFn()))
-.apply("Write content to files", 
TextIO.write().to(filenamePrefix).withOutputFilenames())
-.getPerDestinationOutputFilenames().apply(Values.create());
+  /** IO IT with no compression. */
+  @RunWith(JUnit4.class)
 
 Review comment:
   I double-checked that by running the preCommit job on my machine - those are 
not fired in PreCommit phase. Also, out of curiosity  I investigated a little 
bit the project's mvn structure:
   
   besides the `@RunWith(JUnit.class)` annotation that is required by JUnit, we 
have two mvn plugins that look (scan) for tests:
 - surefire (looks for unit tests and searches for classes with *Test 
suffix)
 - failsafe (looks for integration tests and searches for classes with *IT 
suffix)
   
   As failsafe is not fired in the PreCommit phase, the tests are not invoked. 
Please look at [io parent 
pom](https://github.com/apache/beam/blob/master/sdks/java/io/pom.xml#L77), 
where failsafe plugin is activated only when io-it profile is active. 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16264953#comment-16264953
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj commented on a change in pull request #4149: [BEAM-3060] Add 
Compressed TextIOIT
URL: https://github.com/apache/beam/pull/4149#discussion_r152900718
 
 

 ##
 File path: 
sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java
 ##
 @@ -83,25 +90,82 @@ private static String appendTimestamp(String 
filenamePrefix) {
 return String.format("%s_%s", filenamePrefix, new Date().getTime());
   }
 
-  @Test
-  public void writeThenReadAll() {
-PCollection testFilenames = pipeline
-.apply("Generate sequence", 
GenerateSequence.from(0).to(numberOfTextLines))
-.apply("Produce text lines", ParDo.of(new 
DeterministicallyConstructTestTextLineFn()))
-.apply("Write content to files", 
TextIO.write().to(filenamePrefix).withOutputFilenames())
-.getPerDestinationOutputFilenames().apply(Values.create());
+  /** IO IT with no compression. */
+  @RunWith(JUnit4.class)
+  public static class UncompressedTextIOIT {
+
+@Rule
+public TestPipeline pipeline = TestPipeline.create();
+
+@Test
+public void writeThenReadAll() {
+  PCollection testFilenames = pipeline
+  .apply("Generate sequence", 
GenerateSequence.from(0).to(numberOfTextLines))
+  .apply("Produce text lines", ParDo.of(new 
DeterministicallyConstructTestTextLineFn()))
+  .apply("Write content to files", 
TextIO.write().to(filenamePrefix).withOutputFilenames())
+  .getPerDestinationOutputFilenames().apply(Values.create());
+
+  PCollection consolidatedHashcode = testFilenames
+  .apply("Read all files", TextIO.readAll())
+  .apply("Calculate hashcode", Combine.globally(new HashingFn()));
+
+  String expectedHash = getExpectedHashForLineCount(numberOfTextLines);
+  PAssert.thatSingleton(consolidatedHashcode).isEqualTo(expectedHash);
+
+  testFilenames.apply("Delete test files", ParDo.of(new DeleteFileFn())
+  
.withSideInputs(consolidatedHashcode.apply(View.asSingleton(;
+
+  pipeline.run().waitUntilFinish();
+}
+  }
+
+  /** IO IT with various compression types. */
+  @RunWith(Parameterized.class)
+  public static class CompressedTextIOIT {
+
+@Rule
+public TestPipeline pipeline = TestPipeline.create();
+
+@Parameterized.Parameters()
+public static Iterable data() {
+  return ImmutableList.builder()
+  .add(GZIP)
+  .add(DEFLATE)
+  .add(BZIP2)
+  .build();
+}
+
+@Parameterized.Parameter()
+public Compression compression;
+
+@Test
+public void writeThenReadAllWithCompression() {
+  TextIO.TypedWrite write = TextIO
+  .write()
+  .to(filenamePrefix)
+  .withOutputFilenames()
+  .withCompression(compression);
+
+  TextIO.ReadAll read = TextIO.readAll().withCompression(AUTO);
 
-PCollection consolidatedHashcode = testFilenames
-.apply("Read all files", TextIO.readAll())
-.apply("Calculate hashcode", Combine.globally(new HashingFn()));
+  PCollection testFilenames = pipeline
 
 Review comment:
   This and uncompressed version have the same pipeline. Can't we share to code 
between tests (and keep the same test class TextIOIT) and add "compression 
type" as a parameter to the test (a Maven -D parameter for the perfkit based 
runs) ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 

[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16264952#comment-16264952
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj commented on a change in pull request #4149: [BEAM-3060] Add 
Compressed TextIOIT
URL: https://github.com/apache/beam/pull/4149#discussion_r152900262
 
 

 ##
 File path: 
sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java
 ##
 @@ -83,25 +90,82 @@ private static String appendTimestamp(String 
filenamePrefix) {
 return String.format("%s_%s", filenamePrefix, new Date().getTime());
   }
 
-  @Test
-  public void writeThenReadAll() {
-PCollection testFilenames = pipeline
-.apply("Generate sequence", 
GenerateSequence.from(0).to(numberOfTextLines))
-.apply("Produce text lines", ParDo.of(new 
DeterministicallyConstructTestTextLineFn()))
-.apply("Write content to files", 
TextIO.write().to(filenamePrefix).withOutputFilenames())
-.getPerDestinationOutputFilenames().apply(Values.create());
+  /** IO IT with no compression. */
+  @RunWith(JUnit4.class)
+  public static class UncompressedTextIOIT {
 
 Review comment:
   Does perfkitbenchmarker-based execution 
(https://github.com/apache/beam/pull/4120) still work with these changes ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16264954#comment-16264954
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

chamikaramj commented on a change in pull request #4149: [BEAM-3060] Add 
Compressed TextIOIT
URL: https://github.com/apache/beam/pull/4149#discussion_r152900435
 
 

 ##
 File path: 
sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java
 ##
 @@ -83,25 +90,82 @@ private static String appendTimestamp(String 
filenamePrefix) {
 return String.format("%s_%s", filenamePrefix, new Date().getTime());
   }
 
-  @Test
-  public void writeThenReadAll() {
-PCollection testFilenames = pipeline
-.apply("Generate sequence", 
GenerateSequence.from(0).to(numberOfTextLines))
-.apply("Produce text lines", ParDo.of(new 
DeterministicallyConstructTestTextLineFn()))
-.apply("Write content to files", 
TextIO.write().to(filenamePrefix).withOutputFilenames())
-.getPerDestinationOutputFilenames().apply(Values.create());
+  /** IO IT with no compression. */
+  @RunWith(JUnit4.class)
 
 Review comment:
   This means that this test will be picked up by all test suites (including 
Java pre-commit), isn't it ? Not sure if we want to do that due to the size of 
this test. Adding to post-commit tests should be fine.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16264501#comment-16264501
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

lgajowy commented on issue #4149: [BEAM-3060] Add Compressed TextIOIT
URL: https://github.com/apache/beam/pull/4149#issuecomment-346653319
 
 
   @chamikaramj (this message is a kind reminder) :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16264333#comment-16264333
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

szewi commented on issue #4169: [BEAM-3060] Added support for multiple 
filesystems in TextIO
URL: https://github.com/apache/beam/pull/4169#issuecomment-346620194
 
 
   Hi @chamikaramj , can you please take a look? This allow to switch between 
filesystems by adding system property -Dfilesystem and provide filesystem 
specific pipeline options.  


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16264331#comment-16264331
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

szewi opened a new pull request #4169: [BEAM-3060] Added support for multiple 
filesystems in TextIO
URL: https://github.com/apache/beam/pull/4169
 
 
   Follow this checklist to help us incorporate your contribution quickly and 
easily:
   
- [ ] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
- [ ] Each commit in the pull request should have a meaningful subject line 
and body.
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
- [ ] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
- [ ] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   ---
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16264332#comment-16264332
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

GitHub user szewi opened a pull request:

https://github.com/apache/beam/pull/4169

[BEAM-3060] Added support for multiple filesystems in TextIO

Follow this checklist to help us incorporate your contribution quickly and 
easily:

 - [ ] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
 - [ ] Each commit in the pull request should have a meaningful subject 
line and body.
 - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
 - [ ] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
 - [ ] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
 - [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).

---


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/szewi/beam filesystems-io-it

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/4169.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4169


commit 8d3d2b5e966ffb820b46007afad0244fd0c384bc
Author: Kamil Szewczyk 
Date:   2017-11-21T19:50:04Z

Added support for multiple filesystems in TextIO




> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260003#comment-16260003
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

Github user asfgit closed the pull request at:

https://github.com/apache/beam/pull/4120


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259421#comment-16259421
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

GitHub user lgajowy opened a pull request:

https://github.com/apache/beam/pull/4149

[BEAM-3060] Add Compressed TextIOIT

Follow this checklist to help us incorporate your contribution quickly and 
easily:

 - [ ] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
 - [ ] Each commit in the pull request should have a meaningful subject 
line and body.
 - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
 - [ ] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
 - [ ] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
 - [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).

---

This is a parametrized test for Compressed TextIO. Only the Java code - 
@DariuszAniszewski is working on Perfkit support and Dataflow runner support on 
his separate branch. As ZIP compression type is unsupported, I skipped it in 
the test. 

@chamikaramj could you take a look?

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lgajowy/beam compressed-text-io-test

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/4149.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4149


commit df472abc6ee1b3c2ea021f6069beabd6a4439907
Author: Łukasz Gajowy 
Date:   2017-11-20T16:00:54Z

[BEAM-3060] Add Compressed TextIOIT




> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249368#comment-16249368
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

GitHub user DariuszAniszewski opened a pull request:

https://github.com/apache/beam/pull/4120

[BEAM-3060] TextIOIT: DataFlow and PerfKit profiles + big hash

This PR adds Maven profiles for DataFlow runner and PerfKit to 
`file-based-io-tests`
Additionally hash for large dataset is added and doc for `TextIOIT` is 
fixed. 

Follow this checklist to help us incorporate your contribution quickly and 
easily:

 - [ ] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
 - [ ] Each commit in the pull request should have a meaningful subject 
line and body.
 - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
 - [ ] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
 - [ ] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
 - [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).

---


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/DariuszAniszewski/beam 
textioit-dataflow-perfkit

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/4120.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4120


commit c787e317cf42b21e41cccdf4f2abfeb28f5ab7e3
Author: Dariusz Aniszewski 
Date:   2017-11-07T16:25:55Z

Dataflow and PerfKit profiles; hash for 100.000.000 lines




> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248250#comment-16248250
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

Github user asfgit closed the pull request at:

https://github.com/apache/beam/pull/4081


> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-11-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240377#comment-16240377
 ] 

ASF GitHub Bot commented on BEAM-3060:
--

GitHub user lgajowy opened a pull request:

https://github.com/apache/beam/pull/4081

[BEAM-3060] Adds TextIOIT for DirectRunner and local filesystem

This is one of multiple commits to resolve the 3060 issue. Currently only 
local filesystem,
relatively small datasets and DirectRunner are supported. More runners, 
filesystems
and larger dataset testing ability (of gigabytes size) will be added soon 
in further commits.

See: 
https://docs.google.com/document/d/1dA-5s6OHiP_cz-NRAbwapoKF5MEC1wKps4A5tFbIPKE/edit#

Follow this checklist to help us incorporate your contribution quickly and 
easily:

 - [ ] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
 - [ ] Each commit in the pull request should have a meaningful subject 
line and body.
 - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
 - [ ] Write a pull request description that is detailed enough to 
understand what the pull request does, how, and why.
 - [ ] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
 - [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).

---


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lgajowy/beam text-io-it

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/4081.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4081


commit c6c7070ad92424707d3720d3a4dc2c0fb6961440
Author: Łukasz Gajowy 
Date:   2017-10-31T09:25:22Z

[BEAM-3060] Adds TextIOIT for DirectRunner and local filesystem

This is one of multiple commits to resolve the 3060 issue. Currently only 
local filesystem,
relatively small datasets and DirectRunner are supported. More runners, 
filesystems
and larger dataset testing ability (of gigabytes size) will be added soon 
in further commits.

See: 
https://docs.google.com/document/d/1dA-5s6OHiP_cz-NRAbwapoKF5MEC1wKps4A5tFbIPKE/edit#




> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-10-24 Thread Chamikara Jayalath (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217762#comment-16217762
 ] 

Chamikara Jayalath commented on BEAM-3060:
--

Thanks for the proposal. Added some comments and assigned to JIRA to you.

> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Szymon Nieradka
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms

2017-10-24 Thread Szymon Nieradka (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217232#comment-16217232
 ] 

Szymon Nieradka commented on BEAM-3060:
---

Please find proposed implementation description in: 

https://docs.google.com/document/d/1dA-5s6OHiP_cz-NRAbwapoKF5MEC1wKps4A5tFbIPKE/edit

> Add performance tests for commonly used file-based I/O PTransforms
> --
>
> Key: BEAM-3060
> URL: https://issues.apache.org/jira/browse/BEAM-3060
> Project: Beam
>  Issue Type: Test
>  Components: sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Chamikara Jayalath
>
> We recently added a performance testing framework [1] that can be used to do 
> following.
> (1) Execute Beam tests using PerfkitBenchmarker
> (2) Manage Kubernetes-based deployments of data stores.
> (3) Easily publish benchmark results. 
> I think it will be useful to add performance tests for commonly used 
> file-based I/O PTransforms using this framework. I suggest looking into 
> following formats initially.
> (1) AvroIO
> (2) TextIO
> (3) Compressed text using TextIO
> (4) TFRecordIO
> It should be possibly to run these tests for various Beam runners (Direct, 
> Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) 
> easily.
> In the initial version, tests can be made manually triggerable for PRs 
> through Jenkins. Later, we could make some of these tests run periodically 
> and publish benchmark results (to BigQuery) through PerfkitBenchmarker.
> [1] https://beam.apache.org/documentation/io/testing/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)