[jira] [Commented] (AVRO-2269) Improve usability of Perf.java

ASF GitHub Bot (JIRA) Mon, 26 Nov 2018 10:28:22 -0800


    [ 
https://issues.apache.org/jira/browse/AVRO-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16699400#comment-16699400
 ]


ASF GitHub Bot commented on AVRO-2269:
--------------------------------------

dkulp closed pull request #389: AVRO-2269 Make Perf.java more usable
URL: https://github.com/apache/avro/pull/389
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/build.sh b/build.sh
index 0dc4788e0..10544df05 100755
--- a/build.sh
+++ b/build.sh
@@ -20,9 +20,10 @@ set -e                # exit on error
 cd `dirname "$0"`     # connect to root
 
 VERSION=`cat share/VERSION.txt`
+DOCKER_XTRA_ARGS=""
 
 function usage {
-  echo "Usage: $0 {test|dist|sign|clean|docker|rat|githooks|docker-test}"
+  echo "Usage: $0 {test|dist|sign|clean|docker [--args 
\"docker-args\"]|rat|githooks|docker-test}"
   exit 1
 }
 
@@ -33,8 +34,10 @@ fi
 
 set -x                # echo commands
 
-for target in "$@"
+while (( "$#" ))
 do
+  target="$1"
+  shift
   case "$target" in
 
     test)
@@ -200,6 +203,10 @@ do
       ;;
 
     docker)
+      if [[ $1 =~ ^--args ]]; then
+        DOCKER_XTRA_ARGS=$2
+        shift 2
+      fi
       docker build -t avro-build -f share/docker/Dockerfile .
       if [ "$(uname -s)" == "Linux" ]; then
         USER_NAME=${SUDO_USER:=$USER}
@@ -226,6 +233,7 @@ UserSpecificDocker
         -v ${HOME}/.m2:/home/${USER_NAME}/.m2 \
         -v ${HOME}/.gnupg:/home/${USER_NAME}/.gnupg \
         -u ${USER_NAME} \
+        ${DOCKER_XTRA_ARGS} \
         avro-build-${USER_NAME} bash
       ;;
 
diff --git a/doc/src/content/htmldocs/performance-testing.html 
b/doc/src/content/htmldocs/performance-testing.html
new file mode 100644
index 000000000..fcab40dfe
--- /dev/null
+++ b/doc/src/content/htmldocs/performance-testing.html
@@ -0,0 +1,173 @@
+<html>
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+<head>
+<title>Testing performance improvements</title>
+</head>
+
+<body>
+
+(Note: This document pertains only to the Java implementation Avro.)
+
+
+<h1>1.0 Introduction</h1>
+
+<p>Recent work on improving the performance of "specific record" (<a 
href="https://issues.apache.org/jira/browse/AVRO-2090";>AVRO-2090</a> and <a 
href="https://issues.apache.org/jira/browse/AVRO-2247";>AVRO-2247</a> has 
highlighted the need for a benchmark that can be used to test the validity of 
alleged performance "improvements."</p>
+
+<p> As a starting point, the Avro project has class called <code>Perf</code> 
(in the test source of the <code>ipc</code> subproject).  <code>Perf</code> is 
a command-line tool contains close to 70 performance individual performance 
tests.  These tests include tests for reading and writing primitive values, 
arrays and maps, plus tests for reading and writing records through all of the 
APIs (generic, specific, reflect).</p>
+
+<p> When using <code>Perf</code> for some recent performance work, we 
encountered two problems.  First, because it depends on build artifacts from 
across the Avro project, it can be tricky to invoke.  Second, and more 
seriously, independent runs of the tests in <code>Perf</code> can vary in 
performance by as much as 40%.  While typical variance is less than that, the 
variance is high enough that it makes it impossible to tell if a change in 
performance is simply this noise, or can be properly attributed to a proposed 
optimization. </p>
+
+<p> This document addresses both problems, the usability problem in Section 2 
and the variability issue in Section 3.  Regarding the variability issue, as 
you will see, we haven't really been able to manage it in a fundamental manner. 
 As <a 
href="https://issues.apache.org/jira/browse/AVRO-2269?focusedCommentId=16688925&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16688925";>suggested
 by Zoltan Frakas</a>, we should look into porting <code>Perf</code> over to 
using the <a href="https://java-performance.info/jmh/";>Java Microbenchmark 
Harness (JMH)</a>.</p>
+
+
+<h1>2.0 Invoking <code>Perf</code></h1>
+
+<h2>2.1 Simple invocation</h2>
+
+<p>Here is the easiest way we found to directly invoke <code>Perf</code>.</p>
+
+<p>As mentioned in the Introduction, <code>Perf</code> is dependent upon build 
artifacts from some of the other Avro subprojects.  When you invoke 
<code>Perf</code>, it should be invoked with your most recent build of those 
artifacts (assuming you're performance-testing your current work).  We have 
found that the easiest way to ensure the proper artifacts are used is to use 
Maven to invoke <code>Perf</code>. </p>
+
+<p>The recipe for using Maven in this way is simple.  First, from the 
<code>lang/java</code> directoy, you need to build <em>and install</em> 
Avro:</p>
+
+<p><code>&nbsp;&nbsp;&nbsp;&nbsp;mvn clean install</code></p>
+
+<p>(You can add <code>-DskipTests</code> to the above command line if you 
don't need to run test suite.)  When this is done, you need to change your 
working directory to the <code>lang/java/ipc</code> directory.  From there, you 
can invoke <code>Perf</code> with the following command line:</p>
+
+<p><code>
+&nbsp;&nbsp;&nbsp;&nbsp;mvn exec:java -Dexec.classpathScope=test 
-Dexec.mainClass=org.apache.avro.io.Perf -Dexec.args="..."
+</code></p>
+
+<p>The <code>exec.args</code> string contains the arguments you want to pass 
through to the <code>Perf.main</code> function.</p>
+
+<p>To speed up your edit-compile-test loop, you can do a selective build of 
Avro in addition to skipping tests:
+
+<p><code>&nbsp;&nbsp;&nbsp;&nbsp;mvn clean && mvn -pl 
"avro,compiler,maven-plugin,ipc" install -DskipTests</code></p>
+
+
+
+<h2>2.2 Using the run-perf.sh script</h2>
+
+<p>If you're using <code>Perf</code>, chances are that you want to compare the 
performance of a proposed optimization against the performance of a baseline 
(that baseline most likely being the current master branch of Avro).  
Generating this comparative data can be tedious if you're running 
<code>Perf</code> by hand.  To relieve this tedium, you can use the 
<code>run-perf.sh</code> script instead (found in the <code>share/test</code> 
directory from the Avro top-level directory).</p>
+
+<p>To use this script, you put different implementations of Avro onto 
different branches of your Avro git repository.  One of these branches is 
designated the "baseline" branch and the others are the "treatment" branches.  
The script will run the baseline and all the treatments, and will compare 
generate a CSV file containing a comparison of the treatments against the 
baseline.</p>
+
+<p>Running <code>run-perf.sh&nbsp;--help</code> will output a detailed 
manual-page for this script.  Appendix A of this document contains sample 
invocations of this test script for different use cases.</p>
+
+<p>NOTE: as mentioned in <code>run-perf.sh&nbsp;--help</code>, <b>this script 
is designed to be run from the <code>lang/java/ipc</code> directory</b>, which 
is the Maven project containing the <code>Perf</code> program.</p>
+
+
+
+<h1>3.0 Managing variance</h1>
+
+As mentioned in the introduction, we tried a number of different mechanisms to 
reduce variance, including:
+<ul>
+<li> Varying <code>org.apache.avro.io.perf.count</code>, 
<code>org.apache.io.perf.cycles</code>, and 
<code>org.apache.avro.io.perf.use-direct</code>, as well as the number of times 
we run <code>Perf.java</code> within a single "run" of a test.
+
+<p> <li> Taking the minimum times across runs, rather than the maximum times, 
using the second or third run as a baseline rather than the first, using 
statistical methods to eliminate outlying values.
+
+<p> <li> Modified the code slightly, for example: starting the timer of a 
cycle after, rather than before, encoders or decoders are constructed; cacheing 
encoders and decoders; and reusing record objects during read tests rather than 
construct new ones for each record being read.
+
+<p> <li> Using Docker's <code>--cpuset-cpus</code> flag to force the tests 
onto a single core.
+
+<p> <li> Using a dedicated EC2 instance (<code>c5d.2xlarge</code>).
+</ul>
+Of the above, the only change that made a significant difference was the last: 
in going from a laptop and desktop computer to a dedicated EC2 instances, we 
went from over 70 tests (out of 200) with a variance of 5% or more between runs 
to 35.  As mentioned in the introduction, we should switch to a framework like 
<a href="https://java-performance.info/jmh/";>JMH</a> to attack this problem 
more fundamentally.
+
+<p> If you want to setup your own EC2 instance for testing, here's how we did 
it.  We launched a dedicated EC2 <code>c5d.2xlarge</code> instance from the AWS 
console, using the "Amazon Linux 64-bit HVM GP2" AMI.  We logged into this 
instance and ran the following commands to install Docker and Git (we did all 
our Avro build and testing inside the Docker image):
+<pre>
+  sudo yum update
+  sudo yum install -y git-all
+  git config --global user.name "Your Name"
+  git config --global user.email [email protected]
+  git config --global core.editor emacs
+  sudo install -y docker
+  sudo usermod -aG docker ec2-user ## Need to log back in for this to take 
effect
+  sudo service docker start
+</pre>
+At this point you can checkout Avro and launch your Docker container:
+<pre>
+  git clone https://github.com/apache/avro.git
+  cd avro
+  screen
+  ./build.sh docker --args "--cpuset-cpus 2,6"
+</pre>
+Note the use of <code>screen</code> here: executions of 
<code>run-perf.sh</code> can take a few hours, depending on the configuration.  
By running it inside of <code>screen</code>, you are protected from an SSH 
disconnection causing <code>run-perf.sh</code> to prematurely terminate.
+
+<p>The <code>--args</code> flag in the last command deserves some explanation. 
 In general, the <code>--args</code> allows you to pass additional arguments to 
the <code>docker&nbsp;run</code> command executed inside <code>build.sh</code>. 
 In this case, the <code>--cpuset-cpus</code> flag for <code>docker</code> 
tells docker to schedule the contianer exclusively on the listed (virtual) 
CPUs.  We identified vCPUs 2 and 6 using the <code>lscpu</code> Linux command:
+<pre>
+  [ec2-user@ip-0-0-0-0 avro]$ lscpu --extended
+  CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
+  0   0    0      0    0:0:0:0       yes
+  1   0    0      1    1:1:1:0       yes
+  2   0    0      2    2:2:2:0       yes
+  3   0    0      3    3:3:3:0       yes
+  4   0    0      0    0:0:0:0       yes
+  5   0    0      1    1:1:1:0       yes
+  6   0    0      2    2:2:2:0       yes
+  7   0    0      3    3:3:3:0       yes
+</pre>
+Notice that (v)CPUs 2 and 6 are both on core 2: it's sufficient to schedule 
the container on the same core, vs a single vCPU.  One final tip: to confirm 
that your container is running on the expected CPUs, run <code>top</code> and 
then press the <code>1</code> key -- this will show you the load on each 
individual CPU.
+
+
+<h1>Appendix A: Sample uses of run-perf.sh</h1>
+
+<p>A detailed explanation of <code>run-perf.sh</code> is printed when you give 
it the <code>--help</code> flag.  To help you more quickly understand how to 
use <code>run-perf.sh</code> we present here a few examples of how we used it 
in our recent testing efforts.
+
+<p>  To summarize, you invoke it as follows:
+<pre>
+    ../../../share/test/run-perf.sh [--out-dir D] \
+       [--perf-args STRING] [-Dkey=value]* [--] \
+       [-Dkey=value]* branch_baseline[:name_baseline_run] \
+       [-Dkey=value]* branch_1[:name_treatment_run_1] \
+       ... <br>
+       [-Dkey=value]* branch_n[:name_treatment_run_n] <br>
+</pre>
+The path given here is relative to the <code>lang/java/ipc</code> directory, 
which needs to be the current working directory when calling this script.  The 
script executes multiple <em>runs</em> of testing.  The first run is called the 
<em>baseline run</em>, the subsequent runs are the <em>treatment runs</em>.  
Each run consists of four identical executions of <code>Perf.java</code>.  The 
running times for each <code>Perf.java</code> test are averaged to obtain the 
final running time for the test.  For each treatment run, the final running 
times for each test are compared, as a percentage, to the running time for the 
test in the baseline run.  These percentages are output in the file 
<code>summary.csv</code>.
+
+<p>The following invocation is what we used to measure the variance of 
<code>Perf.java</code>:
+<pre>
+../../../share/test/run-perf.sh --out-dir ~/calibration \
+    -Dorg.apache.avro.specific.use_custom_coders=true \
+    AVRO-2269:baseline AVRO-2269:run1 AVRO-2269:run2 AVRO-2269:run3
+</pre>
+In this invocation, the baseline run and all three treatment runs come from 
the same Git branch: <code>AVRO-2269</code>.  We need to give a name to each 
run: in this case runs have been named "baseline"--the baseline run--and 
"run1", "run2", and "run3"--the treatment runs.  Note that the name of the Git 
branch to be used for a run must always be provided, but the name for the run 
itself (e.g., "baseline") is optional.  If a name for the run is not provided, 
then the name of the Git branch will be used as the name of the run.  However, 
each run must have a unique name, so in this example we had to explicitly name 
the branches since all runs are on the same branch.
+
+<p><code>run-perf.sh</code> uses Maven to invoke <code>Perf.java</code>.  The 
<code>-D</code> flag is used to pass system properties to Maven, which in turn 
will pass them through to <code>Perf.java</code>.  In the example above, we use 
this flag to turn on the custom-coders feature recently checked into Avro.  
Note that initial <code>-D</code> flags will be passed to <em>all</em> runs, 
while <code>-D</code> switches that come just before the name of Git branch of 
a run apply to only that run.  In the case of the baseline run, which comes 
first, if you want to pass <code>-D</code> flags to just that run, then use the 
<code>--</code> flag to indicate that all global parameters for 
<code>run-perf.sh</code> have been provided, followed by the <code>-D</code> 
flags you want to pass to only the baseline run.
+
+<p>Finally, note that <code>run-perf.sh</code> generates a lot of intermediate 
files as well as the final <code>summary.csv</code> file.  Thus, it is 
recommended that the output of each execution of <code>run-pref.sh</code> is 
sent to a dedicated directory, provided by the <code>--out-dir</code> flag.  If 
that directory does not exist, it will be created.  (Observe that 
<code>run-perf.sh</code> outputs a file called <code>command.txt</code> 
containing the full command-line used to invoke it.  This can be helpful if you 
run a lot of experiments and forget the detailed setup of some of them along 
the way.)
+
+<p>The next invocation is what we used to ensure that the new "custom coders" 
optimization for specific records does indeed improve performance:
+<pre>
+../../../share/test/run-perf.sh --out-dir ~/retest-codegen \
+    --perf-args "-Sf" \
+    AVRO-2269:baseline \
+    -Dorg.apache.avro.specific.use_custom_coders=true AVRO-2269:custom-coders
+</pre>
+In this case, unlike the previous one, the <code>-D</code> flag that turns on 
the use of custom coders is applied specifically to the treatment run, and not 
globally.  Also, since this flag only affects the Specific Record case, we use 
the <code>--perf-args</code> flag to pass additional arguments to 
<code>Perf.java</code>; in this case, the <code>-Sf</code> flag tells 
<code>Perf.java</code> to run just the specific-record related tests and not 
the entire test suite.
+
+<p>This last example shows how we checked the performance impact of two new 
feature-branches we've been developing:
+<pre>
+../../../share/test/run-perf.sh --out-dir ~/new-branches \
+    -Dorg.apache.avro.specific.use_custom_coders=true \
+    AVRO-2269:baseline combined-opts full-refactor
+</pre>
+In this case, once again, we turn on custom-coders for all runs.  In this 
case, again, the Git branch <code>AVRO-2269</code> is used for our baseline 
run.  However, in this case, the treatment runs come from two other Git 
branches: <code>combined-opts</code> and <code>full-refactor</code>.  We didn't 
provide run-names for these runs because the Git branch-name were fine to be 
used as run names (we explicitly named the first run "baseline" not because we 
had to, but because we like the convention of using that name).
+
+<p>Although we didn't state it before, in preparing for a run, 
<code>run-perf.sh</code> will checkout the Git branch to be used for the run 
and use <code>mvn&nbsp;install</code> to build and install it.  It does this 
for each branch, so the invocation just given will checkout and build three 
different branches during its overall execution.  (As an optimization, if one 
run uses the same branch as the previous run, then the branch is <em>not</em> 
checked-out or rebuilt between runs.)
+
+</body>
+</html>
diff --git a/lang/java/ipc/src/test/java/org/apache/avro/io/Perf.java 
b/lang/java/ipc/src/test/java/org/apache/avro/io/Perf.java
index 860669468..df4be24f8 100644
--- a/lang/java/ipc/src/test/java/org/apache/avro/io/Perf.java
+++ b/lang/java/ipc/src/test/java/org/apache/avro/io/Perf.java
@@ -102,6 +102,8 @@ void add(List<TestDescriptor> typeList) {
     new TestDescriptor(StringTest.class, "-s").add(BASIC);
     new TestDescriptor(ArrayTest.class, "-a").add(BASIC);
     new TestDescriptor(MapTest.class, "-m").add(BASIC);
+    new TestDescriptor(ExtendedEnumResolveTest.class, "-ee").add(BASIC);
+    new TestDescriptor(UnchangedUnionResolveTest.class, "-uu").add(BASIC);
     BATCHES.put("-record", RECORD);
     new TestDescriptor(RecordTest.class, "-R").add(RECORD);
     new TestDescriptor(ValidatingRecord.class, "-Rv").add(RECORD);
@@ -141,13 +143,14 @@ void add(List<TestDescriptor> typeList) {
   private static final int BYTES_PS_FIELD = 2;
   private static final int ENTRIES_PS_FIELD = 3;
   private static final int BYTES_PC_FIELD = 4;
-  private static final int MAX_FIELD = 4;
+  private static final int MIN_TIME_FIELD = 5;
+  private static final int MAX_FIELD_TAG = 5;
 
   private static void usage() {
-    StringBuilder usage = new StringBuilder("Usage: Perf [-o <file>] [-c 
<spec>] { -nowrite | -noread | ");
+    StringBuilder usage = new StringBuilder("Usage: Perf [-o <file>] [-c 
<spec>] { -nowrite | -noread }");
     StringBuilder details = new StringBuilder();
     details.append(" -o file   (send output to a file)\n");
-    details.append(" -c [n][t][e][b][c] (format as no-header CSV; include 
Name, Time, Entries/sec, Bytes/sec, and/or bytes/Cycle; no spec=all fields)\n");
+    details.append(" -c [n][t][e][b][c][m] (format as no-header CSV; include 
Name, Time, Entries/sec, Bytes/sec, bytes/Cycle, and/or min time/op; no 
spec=all fields)\n");
     details.append(" -nowrite   (do not execute write tests)\n");
     details.append(" -noread   (do not execute write tests)\n");
     for (Map.Entry<String, List<TestDescriptor>> entry : BATCHES.entrySet()) {
@@ -179,6 +182,7 @@ public static void main(String[] args) throws Exception {
     String outputfilename = null;
     PrintStream out = System.out;
     boolean[] csvFormat = null;
+    String csvFormatString = null;
 
     for (int i = 0; i < args.length; i++) {
       String a = args[i];
@@ -200,17 +204,20 @@ public static void main(String[] args) throws Exception {
         continue;
       }
       if ("-c".equals(a)) {
-        if (i == args.length-1 || args[i+1].startsWith("-"))
-          csvFormat = new boolean[] { true, true, true, true, true };
-        else {
-          csvFormat = new boolean[5];
-          for (char c : args[++i].toCharArray())
+        if (i == args.length-1 || args[i+1].startsWith("-")) {
+          csvFormatString = "ntebcm"; // For diagnostics
+          csvFormat = new boolean[] { true, true, true, true, true, true };
+        } else {
+          csvFormatString = args[++i];
+          csvFormat = new boolean[MAX_FIELD_TAG+1];
+          for (char c : csvFormatString.toCharArray())
             switch (c) {
             case 'n': csvFormat[NAME_FIELD] = true; break;
             case 't': csvFormat[TIME_FIELD] = true; break;
             case 'e': csvFormat[BYTES_PS_FIELD] = true; break;
             case 'b': csvFormat[ENTRIES_PS_FIELD] = true; break;
             case 'c': csvFormat[BYTES_PC_FIELD] = true; break;
+            case 'm': csvFormat[MIN_TIME_FIELD] = true; break;
             default:
               usage();
               System.exit(1);
@@ -237,9 +244,12 @@ public static void main(String[] args) throws Exception {
       }
     }
     System.out.println("Executing tests: \n" + tests +  "\n readTests:" +
-        readTests + "\n writeTests:" + writeTests + "\n cycles=" + CYCLES);
+        readTests + "\n writeTests:" + writeTests + "\n cycles=" + CYCLES +
+        "\n count=" + (COUNT / 1000) + "K");
     if (out != System.out) System.out.println(" Writing to: " + 
outputfilename);
-    if (csvFormat != null) System.out.println(" in CSV format.");
+    if (csvFormat != null) System.out.println(" CSV format: " + 
csvFormatString);
+
+    TestResult tr = new TestResult();
 
     for (int k = 0; k < tests.size(); k++) {
       Test t = tests.get(k);
@@ -275,28 +285,41 @@ public static void main(String[] args) throws Exception {
           t.writeTest();
         }
       }
-      t.reset();
+
       // test
-      long s = 0;
       System.gc();
-      t.init();
       if (t.isReadTest() && readTests) {
+        tr.reset();
         for (int i = 0; i < t.cycles; i++) {
-          s += t.readTest();
+          tr.update(t.readTest());
         }
-        printResult(out, csvFormat, s, t, t.name + "Read");
+        printResult(out, csvFormat, tr, t, t.name + "Read");
       }
-      s = 0;
       if (t.isWriteTest() && writeTests) {
+        tr.reset();
         for (int i = 0; i < t.cycles; i++) {
-          s += t.writeTest();
+          tr.update(t.writeTest());
         }
-        printResult(out, csvFormat, s, t, t.name + "Write");
+        printResult(out, csvFormat, tr, t, t.name + "Write");
       }
       t.reset();
     }
   }
 
+  private static class TestResult {
+    public long totalTime;
+    public long minTime;
+    public void reset() {
+      totalTime = 0L;
+      minTime = Long.MAX_VALUE;
+    }
+    public long update(long t) {
+      totalTime += t;
+      minTime = Math.min(t, minTime);
+      return t;
+    }
+  }
+
   private static final void printHeader() {
     String header = String.format(
         "%60s     time    M entries/sec   M bytes/sec  bytes/cycle",
@@ -305,23 +328,25 @@ private static final void printHeader() {
   }
 
   private static final void printResult(PrintStream o, boolean[] csv,
-                                        long s, Test t, String name)
+                                        TestResult tr, Test t, String name)
   {
-    s /= 1000;
+    long s = tr.totalTime / 1000;
     double entries = (t.cycles * (double) t.count);
     double bytes = t.cycles * (double) t.encodedSize;
     StringBuilder result = new StringBuilder();
     if (csv != null) {
       boolean commaneeded = false;
-      for (int i = 0; i <= MAX_FIELD; i++) {
+      for (int i = 0; i <= MAX_FIELD_TAG; i++) {
+        if (! csv[i]) continue;
         if (commaneeded) result.append(",");
         else commaneeded = true;
         switch (i) {
         case NAME_FIELD: result.append(name); break;
         case TIME_FIELD: result.append(String.format("%d", (s/1000))); break;
         case BYTES_PS_FIELD: result.append(String.format("%.3f", (entries / 
s))); break;
-        case ENTRIES_PS_FIELD: result.append(String.format(".3%f", (bytes / 
s))); break;
+        case ENTRIES_PS_FIELD: result.append(String.format("%.3f", (bytes / 
s))); break;
         case BYTES_PC_FIELD: result.append(String.format("%d", 
t.encodedSize)); break;
+        case MIN_TIME_FIELD: result.append(String.format("%d", tr.minTime)); 
break;
         }
       }
     } else {
@@ -388,6 +413,13 @@ public String toString() {
    * higher level constructs, just manual serialization.
    */
   private static abstract class BasicTest extends Test {
+    /** Switch to using a DirectBinaryEncoder rather than a 
BufferedBinaryEncoder
+     *  for writing tests.  DirectBinaryEncoders are noticably slower than 
Buffered
+     *  ones, but they can be more consistent in their performance, which can 
make
+     *  it easier to detect small performance improvements. */
+    public static boolean USE_DIRECT_ENCODER
+      = 
Boolean.parseBoolean(System.getProperty("org.apache.avro.io.perf.use-direct","false"));
+
     protected final Schema schema;
     protected byte[] data;
     BasicTest(String name, String json) throws IOException {
@@ -400,16 +432,16 @@ public String toString() {
 
     @Override
     public final long readTest() throws IOException {
-      long t = System.nanoTime();
       Decoder d = getDecoder();
+      long t = System.nanoTime();
       readInternal(d);
       return (System.nanoTime() - t);
     }
 
     @Override
     public final long writeTest() throws IOException {
-      long t = System.nanoTime();
       Encoder e = getEncoder();
+      long t = System.nanoTime();
       writeInternal(e);
       e.flush();
       return (System.nanoTime() - t);
@@ -428,8 +460,8 @@ protected Decoder newDecoder() {
     }
 
     protected Encoder newEncoder(ByteArrayOutputStream out) throws IOException 
{
-      Encoder e = encoder_factory.binaryEncoder(out, null);
-//    Encoder e = encoder_factory.directBinaryEncoder(out, null);
+      Encoder e = (USE_DIRECT_ENCODER ? 
encoder_factory.directBinaryEncoder(out, null)
+                                      : encoder_factory.binaryEncoder(out, 
null));
 //    Encoder e = encoder_factory.blockingBinaryEncoder(out, null);
 //    Encoder e = new LegacyBinaryEncoder(out);
       return e;
@@ -1419,18 +1451,13 @@ protected Decoder getDecoder() {
     protected final SpecificDatumReader<T> reader;
     protected final SpecificDatumWriter<T> writer;
     private Object[] sourceData;
+    private T reuse;
 
     protected SpecificTest(String name, String writerSchema) throws 
IOException {
       super(name, writerSchema, 48);
       reader = newReader();
       writer = newWriter();
     }
-    protected SpecificDatumReader<T> getReader() {
-      return reader;
-    }
-    protected SpecificDatumWriter<T> getWriter() {
-      return writer;
-    }
     protected SpecificDatumReader<T> newReader() {
       return new SpecificDatumReader<>(schema);
     }
@@ -1444,6 +1471,7 @@ void genSourceData() {
       for (int i = 0; i < sourceData.length; i++) {
         sourceData[i] = genSingleRecord(r);
       }
+      reuse = genSingleRecord(r);
     }
 
     protected abstract T genSingleRecord(Random r);
@@ -1451,7 +1479,7 @@ void genSourceData() {
     @Override
     void readInternal(Decoder d) throws IOException {
       for (int i = 0; i < count; i++) {
-        getReader().read(null, d);
+        reader.read(reuse, d);
       }
     }
     @Override
@@ -1459,7 +1487,7 @@ void writeInternal(Encoder e) throws IOException {
       for (int i = 0; i < sourceData.length; i++) {
         @SuppressWarnings("unchecked")
         T rec = (T) sourceData[i];
-        getWriter().write(rec, e);
+        writer.write(rec, e);
       }
     }
     @Override
@@ -1815,4 +1843,95 @@ protected Rec1 createDatum(Random r) {
       return new Rec1(r);
     }
   }
+
+  static abstract class ResolvingTest extends BasicTest {
+    GenericRecord[] sourceData = null;
+    Schema writeSchema;
+
+    private static String mkSchema(String subschema) {
+      return ("{ \"type\": \"record\", \"name\": \"R\", \"fields\": [\n"
+              + "{ \"name\": \"f\", \"type\": " + subschema + "}\n"
+              + "] }");
+    }
+
+    public ResolvingTest(String name, String r, String w) throws IOException {
+      super(name, mkSchema(r));
+      isWriteTest = false;
+      this.writeSchema = new Schema.Parser().parse(mkSchema(w));
+    }
+
+    @Override
+    protected Decoder getDecoder() throws IOException {
+      return new ResolvingDecoder(writeSchema, schema, super.getDecoder());
+    }
+
+    @Override
+    void readInternal(Decoder d) throws IOException {
+      GenericDatumReader<Object> reader = new GenericDatumReader<>(schema);
+      for (int i = 0; i < count; i++) {
+        reader.read(null, d);
+      }
+    }
+
+    @Override
+    void writeInternal(Encoder e) throws IOException {
+      GenericDatumWriter<Object> writer = new 
GenericDatumWriter<>(writeSchema);
+      for (int i = 0; i < sourceData.length; i++) {
+        writer.write(sourceData[i], e);
+      }
+    }
+
+    @Override
+    void reset() {
+      sourceData = null;
+      data = null;
+    }
+  }
+
+  static class ExtendedEnumResolveTest extends ResolvingTest {
+    private static final String ENUM_WRITER =
+      "{ \"type\": \"enum\", \"name\":\"E\", \"symbols\": [\"A\", \"B\"] }";
+    private static final String ENUM_READER =
+      "{ \"type\": \"enum\", \"name\":\"E\", \"symbols\": 
[\"A\",\"B\",\"C\",\"D\",\"E\"] }";
+
+    public ExtendedEnumResolveTest() throws IOException {
+      super("ExtendedEnum", ENUM_READER, ENUM_WRITER);
+    }
+
+    @Override
+    void genSourceData() {
+      Random r = newRandom();
+      Schema eSchema = writeSchema.getField("f").schema();
+      sourceData = new GenericRecord[count];
+      for (int i = 0; i < sourceData.length; i++) {
+        GenericRecord rec = new GenericData.Record(writeSchema);
+        int tag = r.nextInt(2);
+        rec.put("f", 
GenericData.get().createEnum(eSchema.getEnumSymbols().get(tag), eSchema));
+        sourceData[i] = rec;
+      }
+    }
+  }
+
+  static class UnchangedUnionResolveTest extends ResolvingTest {
+    private static final String UNCHANGED_UNION =
+      "[ \"null\", \"int\" ]";
+
+    public UnchangedUnionResolveTest() throws IOException {
+      super("UnchangedUnion", UNCHANGED_UNION, UNCHANGED_UNION);
+    }
+
+    @Override
+    void genSourceData() {
+      Random r = newRandom();
+      Schema uSchema = writeSchema.getField("f").schema();
+      sourceData = new GenericRecord[count];
+      for (int i = 0; i < sourceData.length; i++) {
+        GenericRecord rec = new GenericData.Record(writeSchema);
+        int val = r.nextInt(1000000);
+        Integer v = (val < 750000 ? new Integer(val) : null);
+        rec.put("f", v);
+        sourceData[i] = rec;
+      }
+    }
+  }
 }
diff --git a/share/test/run-perf.sh b/share/test/run-perf.sh
new file mode 100755
index 000000000..7aa8b0a1e
--- /dev/null
+++ b/share/test/run-perf.sh
@@ -0,0 +1,389 @@
+#!/bin/bash
+
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -ex
+
+function usage {
+  echo "`basename $0` --help"
+  echo "`basename $0` [--min] [--out-dir D] [--iters N] [--skip-one]\\"
+  echo "    [--only-combine] [--perf-args STRING] [-Dkey=value]* \\"
+  echo "    [--] [-Dkey=value]* branch_1[:name_1] .. [-Dkey=value]* 
branch_n[:name_n]"
+  echo
+  echo "Run a set of trials Perf.java trials can compare results."
+  echo "A 'trial' is N runs of Perf.java against the code as it"
+  echo "exists on a branch in git.  By comparing Perf.java output"
+  echo "generated by different branches in git, we can understand"
+  echo "the relative performance of those branches."
+  echo
+  echo "This script must be run in the lang/java/ipc directory of"
+  echo "the Avro source code, on a computer where Maven is installed"
+  echo "and the other build-prerequisites of Avro are in place.  This"
+  echo "script will do a 'mvn clean install' of Avro from withing"
+  echo "the lang/java directory, before running tests."
+  echo
+  echo "The way Perf.java works is that it times an 'inner loop' that"
+  echo "reads or writes a large number of records (the exact number can"
+  echo "be controlled by a system property as described below).  This"
+  echo "inner loop is called a 'cycle.'  Perf.java runs a medium number of"
+  echo "these cycles, and outputs either the average or the minimum"
+  echo "of their running times.  This script runs Perf.java a small number"
+  echo "of times (controllable by the --iters flag), and takes either the"
+  echo "average or minimum of those.  The result of all this is the results"
+  echo "of a single 'trial.'"
+  echo 
+  echo "The basic model is that there is a 'baseline' trial plus any"
+  echo "number of 'treatment' trials.  The goal is to compare the"
+  echo "performance of each treatment against the baseline.  The main"
+  echo "output is written to the file 'summary.csv'.  This file contains"
+  echo "one line per performance test run by Perf.java.  Each row contains"
+  echo "a 'results' column for each trial, followed by a 'comparison' column"
+  echo "for each treatment trial.  The results column contains the average"
+  echo "(or minimum) of the runtimes of all cycles over all iterations of"
+  echo "the trial.  The comparison columns contains the difference between"
+  echo "the performance of the treatment and the baseline, as a percent"
+  echo "of the baseline.  Specifically, it countains"
+  echo "  100*(treatment-baseline)/baseline, i.e., positive numbers mean"
+  echo "we've seen a speedup."
+  echo
+  echo "By default, the running times of cycles are averaged together."
+  echo "The --min flag changes that to taking the minimum."
+  echo 
+  echo "By default, output is written to the current working directory."
+  echo "However, lots of intermediate files are generated, so it's recommended"
+  echo "that the --out-dir argument is used to redirect the output to"
+  echo "a different working directory."
+  echo 
+  echo "By default, the number of iterations in a trial is 4, but this can"
+  echo "be changed with the --iters flag."
+  echo 
+  echo "Perf.java takes a number of command-line arguments, and can be"
+  echo "influenced by system properties.  Command-line arguments can be"
+  echo "passed using the --perf-args flag.  When using this switch, pass"
+  echo "your Perf.java command-line arguments in a single string, even if"
+  echo "there are more than one of them.  You can set system properties"
+  echo " using -Dkey=value switch, just as you would with Maven. System"
+  echo "properties that come before the the '--' switch and the first"
+  echo "branch are passed to all trials. System properties that come after"
+  echo "the '--' switch and/or first branch are passed to the branch that"
+  echo "follows them.  Commonly used system properties include:"
+  echo 
+  echo "     org.apache.avro.io.perf.count -- the number of elements"
+  echo "generated for the inner-most loop of the performance test.  Defaults"
+  echo "to 250K.  Must be a multple of 4."
+  echo
+  echo "     org.apache.avro.io.perf.cycles -- the number of times the inner-"
+  echo "most loop is called within an invocation of Perf.java.  Defaults"
+  echo " to 800."
+  echo 
+  echo "     org.apache.avro.io.perf.use-direct -- use DirectBinaryEncoder 
instead"
+  echo "of BufferedBinaryEncoder for write tests.  It is slower, but 
performance-wise"
+  echo "it can be more consistent, which helps when trying to detect small 
performance"
+  echo "improvements."
+  echo
+  echo "     org.apache.avro.specific.use_custom_coders -- flag that turns on"
+  echo " the use of the custom-coder optimization in the SpecificRecord tests."
+  echo "Defaults to 'false;' set to 'true' to turn them on."
+  echo 
+  echo "Trials, as indicated, are branches in git.  The branch_i arguments"
+  echo " indicate which what branches make up a trial.  The first of these"
+  echo "(branch_1) is considered the \"baseline\" trial: it's the trial"
+  echo "that all the others are compared against.  (However, if the --skip-one"
+  echo "is provided, the result from the first trial is ignored and the second"
+  echo "becomes the baseline.)"
+  echo
+  echo "Each trial has a name as well as a branch.  By default, the name of"
+  echo "the branch is the name of the trial, but an explicit name can be given"
+  echo "by suffixing the branch name with a trial name (e.g., 'foo:bar' will"
+  echo "use the branch 'foo' for a trial, but the trial will be named 'bar')."
+  echo "Trials must have unique names, so when multiple trials are run off the"
+  echo "same branch, explicit trial names must be used."
+  echo
+  echo "In addition to writing 'summary.csv', this script outputs other files,"
+  echo "allowing you to analyze the granular results of a test run.  The file"
+  echo "results.csv contains a row per test in Perf.java.  Each column"
+  echo "contains the result of a single run of Perf.java.  If N is the"
+  echo "number of iterations in a trial, then the first N columns are the"
+  echo "results from the individual iterations of the first trial, the"
+  echo "next N are the results from the second trial, and so forth.  In"
+  echo "addition, for each branch B being tested, there are multiple"
+  echo "files 'B_i.csv' for each iteration i in the trial.  These per-trial"
+  echo "files have two columns, the first being the name of the test, the"
+  echo "second being the result of that test.  Thus, 'result.csv' is the"
+  echo "result of joining these per-trial files on the trial-name, and"
+  echo "summary.csv averages (or takes the minimum) of these per-trial"
+  echo "results, and adds the comparison column."
+  echo
+  echo "If the --only-combine flag is given, then the script will assume"
+  echo "that the B_i files have been generated, and will simply join them"
+  echo "to compute results.csv and summary.csv.  This allows you to debug"
+  echo "the code that combines these files without having to wait around"
+  echo "for Perf.java to be run a bunch of times."
+}
+
+if [[ "$1" == "--help" ]]; then
+  usage
+  exit 0
+fi
+
+if [[ ! `pwd` =~ java/ipc ]]; then
+  echo "Must be run from lang/java/ipc"
+  echo "Type `basename $0` --help for help"
+  exit 1
+fi
+
+TEST="-c nt"
+EXTRA_CLI=""
+OUT="."
+SKIP_ONE="false"
+STATIC_SYSPROPS=()
+ITERS=4
+
+# DBG=echo
+
+function Perf_java {
+  local fname=$1
+  shift
+
+  if [[ "$DBG" != "" ]]; then
+    $DBG MAVEN_OPTS=-server mvn exec:java -Dexec.classpathScope=test \
+      -Dexec.mainClass=org.apache.avro.io.Perf ${STATIC_SYSPROPS[@]} \
+      -Dexec.args="${TEST} -o ${fname} ${EXTRA_CLI}" \
+      $@
+  else
+    mvn exec:java -Dexec.classpathScope=test \
+      -Dexec.mainClass=org.apache.avro.io.Perf ${STATIC_SYSPROPS[@]} \
+      -Dexec.args="${TEST} -o ${fname} ${EXTRA_CLI}" \
+      $@
+  fi
+}
+
+function run_trial {
+  local lastbranch=$1
+  local thisbranch=$2
+  local thistrialname=$3
+  shift 3
+
+  if [[ "$thisbranch" != "$lastbranch" ]]; then
+    $DBG git checkout $thisbranch
+    (cd ..; $DBG mvn clean && $DBG mvn -pl "avro,compiler,maven-plugin,ipc" 
install -DskipTests)
+  fi
+  for i in $(seq 1 ${ITERS}); do Perf_java ${OUT}/${thistrialname}${i}.csv $@; 
done
+}
+
+function run_trials {
+  local -a allprops=( )
+
+  while (( "$#" )); do
+    case "$1" in
+      --)
+        break;
+        ;;
+      *)
+        allprops+=( $1 )
+        shift
+        ;;
+    esac
+  done
+
+  local -a thisprops=( )
+  local lastbranch=""
+  local thisbranch
+  local thistrialname
+
+  while (( "$#" )); do
+    case "$1" in
+      --) # Ignore these
+        shift
+        ;;
+      -D*)
+        thisprops+=( $1 )
+        shift
+        ;;
+      *)
+        thisbranch=$1
+        thistrialname=$2
+        git rev-parse --verify $thisbranch
+        run_trial "$lastbranch" $thisbranch $thistrialname ${allprops[@]} 
${thisprops[@]}
+        lastbranch=$thisbranch
+        thisprops=( )
+        shift 2
+        ;;
+    esac
+  done
+}
+
+function join_results {
+  pushd ${OUT}
+  local header="TestName"
+  for b in $@; do
+    for i in $(seq 1 ${ITERS}); do
+      header="${header},${b}${i}"
+    done
+  done
+#  echo $header > results.csv
+  if [[ "$SKIP_ONE" == "true" ]]; then shift; fi
+  cut -d , -f 1,2 ${1}1.csv | sort >> results.csv
+  if [[ 1 < "${ITERS}" ]]; then
+    for i in $(seq 2 ${ITERS}); do
+      cut -d , -f 1,2 ${1}$i.csv | sort | join -t , results.csv - > tmp.csv
+      mv tmp.csv results.csv
+    done
+  fi
+  shift
+  for b in $@; do
+    for i in $(seq 1 ${ITERS}); do
+      cut -d , -f 1,2 ${b}$i.csv | sort | join -t , results.csv - > tmp.csv
+      mv tmp.csv results.csv
+    done
+  done
+  popd
+}
+
+AVG='BEGIN { RS=" "; } { s += $1; n += 1; } END { printf "%f", s/n; }'
+MIN='BEGIN { RS=" "; m = 10000000000; } { if ($1 < m) m = $1; } END { printf 
"%f", m; }'
+PERCENT='{ printf "%f", 100*($1-$2)/$1; }'
+
+function print_line {
+  local line=$1
+  shift
+  local awks
+  if [[ "$TEST" == "-c nt" ]]; then awks="$AVG"; else awks="$MIN"; fi
+
+  local -a results=( )
+  for t in ${trials[*]}; do
+    local result=""
+    for i in $(seq 1 $ITERS); do
+      result="$result $1"
+      shift
+    done
+    result=$(echo $result | awk "$awks")
+    results+=( $result )
+    line="${line},${result}"
+  done
+
+  local baseline=0
+  if [[ "$SKIP_ONE" == "true" ]]; then start=1; fi
+  for i in $(seq `expr ${baseline} + 1` `expr ${#trials[*]} - 1`); do
+    result=$(echo "${results[$baseline]} ${results[$i]}" | awk "$PERCENT")
+    line="${line},${result}"
+  done
+  echo "$line"
+}
+
+
+
+###
+### ACTUAL SCRIPT STARTS HERE
+###
+
+declare command="$0 $*"
+declare onlycombine="false"
+declare -a run_trials_args=( )
+declare -a trials=( )
+
+while (( "$#" )); do
+  case "$1" in
+    --help)
+      usage
+      exit
+      ;;
+    --min)
+      TEST="-c nm"
+      shift
+      ;;
+    --out-dir)
+      if [[ $OUT != "." ]]; then
+        echo "Cannot use --out-dir twice."
+        echo "Type `basename $0` --help for help"
+        exit 1
+      fi
+      OUT=$2
+      mkdir -p $OUT
+      shift 2
+      ;;
+    --iters)
+      ITERS=$2
+      shift 2
+      ;;
+    --only-combine)
+      onlycombine="true"
+      shift
+      ;;
+    --skip-one)
+      SKIP_ONE="true"
+      shift
+      ;;
+    --perf-args)
+      EXTRA_CLI=$2
+      shift 2
+      ;;
+    -D*)
+      if [[ ! $1 =~ ^-D[^\ =]+= ]]; then
+        echo "Bad system property: $1"
+        echo "Type `basename $0` --help for help"
+        exit 1
+      fi
+      run_trials_args+=( $1 )
+      shift
+      ;;
+    --)
+      run_trials_args+=( $1 )
+      shift
+      ;;
+    --*)
+      echo "Unknown switch: $1"
+      echo "Type `basename $0` --help for help"
+      exit 1
+      ;;
+    *)
+      if [[ "$1" =~ ^([^:]*):(.*) ]]; then
+        thisbranch=${BASH_REMATCH[1]}
+        thistrialname=${BASH_REMATCH[2]}
+      else
+        thisbranch=$1
+        thistrialname=$1
+      fi
+      if [[ "$thisbranch" == "" || "$thistrialname" == "" ]]; then
+        echo "Neither branch ($thisbranch) nor trial ($thistrialname) names 
may be empty"
+        echo "Type `basename $0` --help for help"
+        exit 1
+      fi
+      if [[ "${trials[@]}" =~ $thistrialname ]]; then
+        echo "Trial named '$thistrialname' is not unique"
+        echo "Type `basename $0` --help for help"
+        exit 1
+      fi
+      trials+=( "$thistrialname" )
+      run_trials_args+=( "--" "$thisbranch" "$thistrialname" )
+      shift
+      ;;
+  esac
+done
+
+# Document how the outputs were generated
+echo "$command" > $OUT/command.txt
+
+if [[ ${onlycombine} == "false" ]]; then
+  run_trials ${run_trials_args[@]}
+fi
+
+join_results ${trials[@]}
+
+cat $OUT/results.csv | while read line; do
+  fields=( $(echo $line | tr "," " ") )
+  print_line "${fields[@]}"
+done > $OUT/summary.csv


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Improve usability of Perf.java
> ------------------------------
>
>                 Key: AVRO-2269
>                 URL: https://issues.apache.org/jira/browse/AVRO-2269
>             Project: Apache Avro
>          Issue Type: Test
>          Components: java
>            Reporter: Raymie Stata
>            Assignee: Raymie Stata
>            Priority: Major
>
> The class {{org.apache.avro.ipc.io.Perf}} is Avro's performance test suite.  
> This JIRA aims to make it easier to use.  Specifically:
> * Added a file {{performance-testing.html}} with guidance on how to use the 
> suite
> * Added script {{run-script.sh}} that uses {{Perf}} to run structured 
> experiments.
> * Added tests for performance of resolution of unchanged unions and 
> enumerations, which will be subject to future optimizations.
> * Tweaks to {{Perf}} for better experimentation (e.g., support for minimum as 
> well as average aggregation).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (AVRO-2269) Improve usability of Perf.java

Reply via email to