[MINOR] Profile memory use in JMLC execution

This PR adds utilities to profile memory use during execution in JMLC. 
Specifically, the following changes were made:

1. Added options setStatistics() and gatherMemStats() to api.jmlc.Connection 
which control whether or not statistics should be gathered, and if so, whether 
memory use should be profiled. Also added an appropriate method to 
api.jmlc.PreparedScript to display the resulting statistics. The following 
points are only applicable when running in JMLC mode, and memory statistics 
have been enabled. Both these options are false by default.
2. Modified utils.Statistics to track the memory used by distinct CacheBlock 
objects. At the conclusion of the script, the maximum memory use is reported. 
Memory use is computed by calling the object's getInMemorySize() method. This 
will generally be a slight over-estimate of the actual memory used by the 
object.
3. If FINEGRAINED_STATISTICS are enabled, Statistics will also track the memory 
use by each named variable in a DML script and report this in a table as in 
heavy hitter instructions. The goal of this is to detect unexpected large 
intermediate matrices (e.g. resulting from an outer product X %*% t(X)).
4. If FINEGRAINED_STATISTICS are enabled, Statistics will attempt to measure 
more accurate memory use by checking to see if an object has been garbage 
collected. This is done by maintaining a soft reference to the object and 
periodically checking to see if it has become null. This is enabled only when 
using fine-grained statistics since it introduces potentially non-trivial 
overheads by scanning a list of live objects. Note that simply using rmvar to 
remove a live variable results in a substantial underestimate of memory used by 
the program and so this method is not used. When finegrained statistics are not 
enabled, the resulting statistics will be an overestimate.

Potential impacts to performance: when finegrained statistics are enabled there 
will be some performance degradation from maintaining the set of live variables.

Potential Improvements: Related to the above, it would be nice to find a way of 
accurately tracking when an object is actually released without resorting to 
checking whether a soft reference has become null. It might also be nice to 
include a line number indicating where a "heavy hitting object" was created to 
make debugging easier.

Closes #794.


Project: http://git-wip-us.apache.org/repos/asf/systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/systemml/commit/a2d3a721
Tree: http://git-wip-us.apache.org/repos/asf/systemml/tree/a2d3a721
Diff: http://git-wip-us.apache.org/repos/asf/systemml/diff/a2d3a721

Branch: refs/heads/gh-pages
Commit: a2d3a721d05851ed03cb3fa5320075d7872b7ed0
Parents: af4cf76
Author: Anthony Thomas <[email protected]>
Authored: Fri Jul 6 11:10:17 2018 -0700
Committer: Niketan Pansare <[email protected]>
Committed: Fri Jul 6 11:23:36 2018 -0700

----------------------------------------------------------------------
 jmlc.md | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/systemml/blob/a2d3a721/jmlc.md
----------------------------------------------------------------------
diff --git a/jmlc.md b/jmlc.md
index 2183700..a703d01 100644
--- a/jmlc.md
+++ b/jmlc.md
@@ -49,6 +49,18 @@ of SystemML's distributed modes, such as Spark batch mode or 
Hadoop batch mode,
 distributed computing capabilities. JMLC offers embeddability at the cost of 
performance, so its use is
 dependent on the nature of the business use case being addressed.
 
+## Statistics
+
+JMLC can be configured to gather runtime statistics, as in the MLContext API, 
by calling Connection's `setStatistics()`
+method with a value of `true`. JMLC can also be configured to gather 
statistics on the memory used by matrices and
+frames in the DML script. To enable collection of memory statistics, call 
Connection's `gatherMemStats()` method
+with a value of `true`. When finegrained statistics are enabled in 
`SystemML.conf`, JMLC will also report the variables
+in the DML script which used the most memory. By default, the memory use 
reported will be an overestimte of the actual
+memory required to run the program. When finegrained statistics are enabled, 
JMLC will gather more accurate statistics
+by keeping track of garbage collection events and reducing the memory estimate 
accordingly. The most accurate way to
+determine the memory required by a script is to run the script in a single 
thread and enable finegrained statistics.
+
+An example showing how to enable statistics in JMLC is presented in the 
section below.
 
 ---
 
@@ -114,11 +126,19 @@ the resulting `"predicted_y"` matrix. We repeat this 
process. When done, we clos
  
         // obtain connection to SystemML
         Connection conn = new Connection();
+
+        // turn on gathering of runtime statistics and memory use
+        conn.setStatistics(true);
+        conn.gatherMemStats(true);
  
         // read in and precompile DML script, registering inputs and outputs
         String dml = conn.readScript("scoring-example.dml");
         PreparedScript script = conn.prepareScript(dml, new String[] { "W", 
"X" }, new String[] { "predicted_y" }, false);
- 
+
+        // obtain the runtime plan generated by SystemML
+        String plan = script.explain();
+        System.out.println(plan);
+
         double[][] mtx = matrix(4, 3, new double[] { 1, 2, 3, 4, 5, 6, 7, 8, 9 
});
         double[][] result = null;
  
@@ -127,6 +147,10 @@ the resulting `"predicted_y"` matrix. We repeat this 
process. When done, we clos
         script.setMatrix("X", randomMatrix(3, 3, -1, 1, 0.7));
         result = script.executeScript().getMatrix("predicted_y");
         displayMatrix(result);
+
+        // print the resulting runtime statistics
+        String stats = script.statistics();
+        System.out.println(stats);
  
         script.setMatrix("W", mtx);
         script.setMatrix("X", randomMatrix(3, 3, -1, 1, 0.7));

Reply via email to