well, these are all very interesting questions - let's discuss them separately:
1) Average/worst-case formulas in SystemML: The referenced formulas in our code base follow fairly simple and intuitive patterns. * For the average-case estimate, we assume non-zeros are uniformly distributed and hence compute the output sparsity as the probability that an output cell is non-zero. That is 1 minus the probability that it's a zero. For a single multiply-add, this is (1-sp1*sp2) and hence (1-sp1*sp2)^k for the entire dot product over the common dimension of length k (assuming the output is zero if and only if all multiples are zero). Due to the assumption of uniformity, the output dimensions are irrelevant here. * For the worst-cast estimates as actually used within SystemML's memory estimates (in order to guarantee that we do not run out of memory), we however assume a worst-case structure, which is a column vector of non-zeros in the left-hand-side and an aligned row vector in the right-hand-side. This gives us min(1, nnz1/m) * min(1, nnz2/n), which is obviously very conservative. We introduced these formulas in our optimizer paper [1] but other groups either used (for extensions) or at least referenced exactly the same formulas [2, 3]. 2) Relevance in practice: There are three reasons why we still rely on such a very conservative estimate. First, it allows us to guarantee memory requirements, which also includes potential transitions from dense to sparse. Second, all our matrix multiplications, except ultra-sparse, use dense outputs anyway to allow efficient cache-conscious implementations. However, note that the output estimate of an operation implicitly affects the memory estimate of any data dependent operations as it becomes part of their input memory estimate. Third, whenever we cannot estimate the output sparsity and it matters because the dimensions are large too, subsequent operations are flagged for dynamic recompilation. During dynamic recompilation, the exact number of non-zeros are known and used to recompile runtime plans. There are even a couple of rewrites that explicitly split HOP DAGs after permutation matrix multiples, to create recompilation hooks before any subsequent operations. In practice, dynamic recompilation is actually able to mitigate compiler shortcomings such as these conservative estimates very well. 3) Sketches: Overall sketching is indeed a very common approach to remove the assumption of uniformly distributed non-zeros and thus arrive at better estimates for skewed real-world matrices. The hashing technique from the paper Dylan mentioned is a very compelling example of that (when I first read this paper a while ago, it reminded me of KMV synopsis and the propagation of such). [2] and [3] also used forms of sketches in terms of systematic density maps and sampling of row/columns pairs for chains of matrix multiplications. A related challenge is the estimation of sparsity for intermediates of an entire matrix multiplication chain. In order to evaluate alternative evaluation plans, we need to recursively compute sketches for intermediates that cannot be sampled directly. If anybody is interested in that, let me know and we can have a more detailed discussion. Regards, Matthias [1] Matthias Boehm, Douglas R. Burdick, Alexandre V. Evfimievski, Berthold Reinwald, Frederick R. Reiss, Prithviraj Sen, Shirish Tatikonda, Yuanyuan Tian: SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs. IEEE Data Eng. Bull. 37(3): 52-62 (2014), http://sites.computer.org/debull/A14sept/p52.pdf, page 7 [2] David Kernert, Frank Köhler, Wolfgang Lehner: SpMacho - Optimizing Sparse Linear Algebra Expressions with Probabilistic Density Estimation. EDBT 2015: 289-300, http://openproceedings.org/2015/conf/edbt/paper-117.pdf, page 6 [3] Yongyang Yu, MingJie Tang, Walid G. Aref, Qutaibah M. Malluhi, Mostafa M. Abbas, Mourad Ouzzani: In-Memory Distributed Matrix Computation Processing and Optimization. ICDE 2017: 1047-1058, page 3 On Tue, Jun 20, 2017 at 2:46 PM, Dylan Hutchison <[email protected] > wrote: > Hi Nakul, > > The paper I referenced uses the term "Boolean matrix", but really it > applies to any matrix product in which two properties hold: > > 1. Zero product property. a * b = 0 ==> a = 0 or b = 0 > 2. Few, if any, "sums to zero". There should be few cases of nonzero > partial products summing to zero. > > Property #1 holds for standard arithmetic * over the reals. If we multiply > matrices with positive entries, then there are no sums to zero. Otherwise, > if there are many positive and negative entries, the number of nonzero > entries in the result may be fewer than the paper's algorithm would > predict. > > Specifically, under these two properties, nnz( A * B ) = nnz( ||A||_0 * > ||B||_0 ), where ||A||_0 is the "zero norm", which maps nonzero entries to > 1 and zero entries to 0. > > I don't know about the origin of the current heuristic for SystemML. > > Cheers, Dylan > > > On Tue, Jun 20, 2017 at 2:00 PM, Nakul Jindal <[email protected]> wrote: > > > Thank you Dylan. The paper seems interesting. The abstract reads as > follows > > : "We consider the problem of doing fast and reliable estimation of the > > number z of non-zero entries in a sparse boolean matrix product"... > > I think they maybe doing this for boolean matrices. The code that I > pointed > > to in SystemML is used for all matrices. > > I am not sure if and how their technique translates to non-boolean > > matrices. > > > > Also, I am asking for an explanation of what is implemented as opposed to > > what can be. > > I was wondering if the neat looking formula has been gotten from some > > literature somewhere? > > > > -Nakul > > > > > > > > > > On Tue, Jun 20, 2017 at 1:42 PM, Dylan Hutchison < > > [email protected] > > > wrote: > > > > > In case it is helpful, this 2010 paper > > > <https://www.semanticscholar.org/paper/Better-Size- > > > Estimation-for-Sparse-Matrix-Products-Amossen-Campagna/ > > > 52f70406bb4d887e1b79af81a746184c5723848a> > > > is my favorite for estimating the nnz of a matrix product with high > > > confidence. The algorithm is much more involved than the simple > SystemML > > > calculation, because it looks at samples from the actual data. > > > > > > Here are some notes I have on the paper: > > > > > > There is a sketch algorithm [2] that gives a (1 + ε) approximation z̃ > to > > z > > > for any ε > 4 / (z^.25) in O(m) > > > time and with a bound on the number of I/Os. In relational algebra, > this > > is > > > estimating > > > |π_{ik}( A(i, j) ⋈ B(j, k) )|. The idea behind the algorithm is > finding, > > > for some tuning parameter > > > k that varies with z, the k-smallest value of a hash function h(i; k). > > The > > > larger this value is, the more > > > is and ks that match for a given j. Repeat this for all js. Different > (i, > > > k)s contribute to different entries > > > in the result matrix and have different hash values. A sketch for this > > > algorithm can be incrementally > > > maintained via independent samples on is and ks. Computing the z̃ > > estimate > > > from the sample takes o(n) > > > time for large enough z. The larger z is, the smaller a sample size we > > > need. (Need large samples for > > > small z.) (Everything above assumes the zero product property and > > > zero-sum-free). > > > > > > > > > > > > On Tue, Jun 20, 2017 at 12:06 PM, Nakul Jindal <[email protected]> > > wrote: > > > > > > > Hi, > > > > > > > > There is code in systemml to “estimate” the output sparsity of a > matrix > > > > multiplication operation between two sparse matrices. > > > > Is there a citation for this? If not, have we done any tests to > figure > > > out > > > > how good these estimates are? > > > > https://github.com/nakul02/systemml/blob/df8d4a63d8d09cae94b > > > > 6ca2634e31da554302c72/src/main/java/org/apache/sysml/ > > > > hops/OptimizerUtils.java#L944-L953 > > > > > > > > Thanks, > > > > Nakul > > > > > > > > > >
