Author: apalumbo
Date: Wed Apr 15 23:51:14 2015
New Revision: 1673980
URL: http://svn.apache.org/r1673980
Log:
add out-of-core DSL reference page
Added:
mahout/site/mahout_cms/trunk/content/users/environment/out-of-core-reference.mdtext
Added:
mahout/site/mahout_cms/trunk/content/users/environment/out-of-core-reference.mdtext
URL:
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/environment/out-of-core-reference.mdtext?rev=1673980&view=auto
==============================================================================
---
mahout/site/mahout_cms/trunk/content/users/environment/out-of-core-reference.mdtext
(added)
+++
mahout/site/mahout_cms/trunk/content/users/environment/out-of-core-reference.mdtext
Wed Apr 15 23:51:14 2015
@@ -0,0 +1,308 @@
+# Mahout-Samsara's Out-Of-Core Linear Algebra DSL Reference
+
+**Note: this page is meant only as a quick reference to Mahout-Samsara's
R-Like DSL semantics. For more information, including information on
Mahout-Samsara's Algebraic Optimizer please see: [Mahout Scala Bindings and
Mahout Spark Bindings for Linear Algebra
Subroutines](http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf).**
+
+The subjects of this reference are solely applicable to Mahout-Samsara's
**DRM** (distributed row matrix).
+
+In this reference, DRMs will be denoted as i.e. `A`, and in-core matrices as
i.e. `inCoreA`.
+
+#### Imports
+
+The following imports are used for to enable seamless in-core and distributed
algebraic DSL operations:
+
+ import org.apache.mahout.math._
+ import scalabindings._
+ import RLikeOps._
+ import drm._
+ import RLikeDRMOps._
+
+If working with mixed scala/java code:
+
+ import collection._
+ import JavaConversions._
+
+If you are working with Mahout-Samsara's Spark-specific operations e.g. for
context creation:
+
+ import org.apache.mahout.sparkbindings._
+
+The Mahout shell does all of these import automatically.
+
+
+#### DRM Persistence operators
+
+**Mahout-Samsara's DRM persistance to HDFS is compatible with all
Mahout-MapReduce algorithms such as seq2sparse.**
+
+
+Loading a DRM from (HD)FS:
+
+ drmDfsRead(path = hdfsPath)
+
+Parallelizing from an in-core matrix:
+
+ val inCoreA = (dense(1, 2, 3), (3, 4, 5))
+ val A = drmParallelize(inCoreA)
+
+Creating an empty DRM:
+
+ val A = drmParallelizeEmpty(100, 50)
+
+Collecting to driver's jvm in-core:
+
+ val inCoreA = A.collect
+
+**Warning: The collection of distributed matrices happens implicitly whenever
conversion to an in-core (o.a.m.math.Matrix) type is required. E.g.:**
+
+ val inCoreA: Matrix = ...
+ val drmB: DrmLike[Int] =...
+ val inCoreC: Matrix = inCoreA %*% drmB
+
+**implies (incoreA %*% demb).collect**
+
+Collecting to (HD)FS as a Mahout's DRM formatted file:
+
+ A.dfsWrite(path = hdfsPath)
+
+#### Logical algebraic opertors on DRM matrices:
+
+A logical set of operators are defined for distributed matrices as as a subset
of those defined for in-core matrices. In particular, since all distributed
matrices are immutable, there are no assignment opertors (e.g. **A += B**)
+*Note: please see: [Mahout Scala Bindings and Mahout Spark Bindings for Linear
Algebra
Subroutines](http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf).
For information on Mahout-Samsars's Algebraic Optimizer, and translation from
logical operations to a physical plan for the back-end.*
+
+
+Cache a DRM and trigger an optimized physical plan:
+
+ drmA.checkpoint(CacheHint.MEMORY_AND_DISK)
+
+other valid caching Instructions:
+
+ drmA.checkpoint(CacheHint.NONE)
+ drmA.checkpoint(CacheHint.DISK_ONLY)
+ drmA.checkpoint(CacheHint.DISK_ONLY_2)
+ drmA.checkpoint(CacheHint.MEMORY_ONLY)
+ drmA.checkpoint(CacheHint.MEMORY_ONLY_2)
+ drmA.checkpoint(CacheHint.MEMORY_ONLY_SER
+ drmA.checkpoint(CacheHint.MEMORY_ONLY_SER_2)
+ drmA.checkpoint(CacheHint.MEMORY_AND_DISK_2)
+ drmA.checkpoint(CacheHint.MEMORY_AND_DISK_SER)
+ drmA.checkpoint(CacheHint.MEMORY_AND_DISK_SER_2)
+
+*Note: Logical DRM operations are lazily computed. Currently the actual
computations and optional caching will be triggered by dfsWrite(...),
collect(...) and blockify(...).*
+
+
+
+Transposition:
+
+ A.t
+
+Elementwise addition *(Matrices of identical geometry and row key types)*:
+
+ A + B
+
+Elementwise subtraction *(Matrices of identical geometry and row key types)*:
+
+ A - B
+
+Elementwise multiplication (Hadamard) *(Matrices of identical geometry and row
key types)*:
+
+ A * B
+
+Elementwise division *(Matrices of identical geometry and row key types)*:
+
+ A / B
+
+**Elementwise operations involving one in-core argument (int-keyd DRMs only)**:
+
+ A + inCoreB
+ A - inCoreB
+ A * inCoreB
+ A / inCoreB
+ A :+ inCoreB
+ A :- inCoreB
+ A :* inCoreB
+ A :/ inCoreB
+ inCoreA +: B
+ inCoreA -: B
+ inCoreA +: B
+ inCoreA /: B
+
+*Note: Spark associativity change (e.g. A :+ inCoreB means B.leftMultiply(A),
same as when both arguments are in core). Whenever operator arguments include
both in-core and out-of-core arguments, the operator can only be associated
with the out-of-core (DRM) argument to support the distributed implementation.*
+
+**Matrix-matrix multiplication %*%**:
+
+`\(\mathbf{M}=\mathbf{AB}\)`
+
+ A %*% B
+ A %*% inCoreB
+ A %*% inCoreDiagonal
+ A %*%: B
+
+
+*Note: same as above, whenever operator arguments include both in-core and
out-of-core arguments, the operator can only be associated with the out-of-core
(DRM) argument to support the distributed implementation.*
+
+**Matrix-vector multiplication %*%**
+Currently we support a right multiply product of a DRM and an in-core
Vector(`\(\mathbf{Ax}\)`) resulting in a single column DRM, which then can be
collected in front (usually the desired outcome):
+
+ val Ax = A %*% x
+ val inCoreX = Ax.collect(::, 0)
+
+
+**Matrix-scalar +,-,*,/**
+Elementwise operations of every matrix element and a scalar:
+
+ A + 5.0
+ A - 5.0
+ A :- 5.0
+ 5.0 -: A
+ A * 5.0
+ A /5.0
+ 5.0 /: a
+
+Note that `5.0 -: A` means `\(m_{ij} = 5 - a_{ij}\)` and `5.0 :/ A` means
`\(m_{ij} = \frac{5}{a{ij}}\)` for all elements of the result.
+
+
+#### Slicing
+
+General slice:
+
+ A(100 to 200, 100 to 200)
+
+Horizontal Block:
+
+ A(::, 100 to 200)
+
+Vertical Block:
+
+ A(100 to 200, ::)
+
+*Note: if row range is not all-range (::) the the DRM must be `Int`-keyed.
General case row slicing is not supported by DRMs with key types other than
`Int`*.
+
+
+#### Stitching
+
+Stitch side by side (cbind R semantics):
+
+ val drmAnextToB = drmA cbind drmB
+
+Stiching side by side (Scala):
+
+ val drmAnextToB = drmA.cbind(drmB)
+
+Analgously, vertical concatenation is avialable via **rbind**
+
+#### Custom pipelines on blocks
+Internaly, Mahout-Samsara's DRM is represented as a distributed set of
vertical (Key, Block) tuples.
+
+**drm.mapBlock(...)**:
+
+The DRM operator `mapBlock` provides transformational to the distributed
vertical blockified tuples of a matrix (Row-Keys, Vertical-Matrix-Block).
+
+Using `mapBlock` to add 1.0 to a DRM:
+
+ val inCoreA = dense((1, 2, 3), (2, 3 , 4), (3, 4, 5))
+ val drmA = drmParallelize(inCoreA)
+ val B = A.mapBlock() {
+ case (keys, block) => keys -> (block += 1.0)
+ }
+
+#### Broadcasting Vectors and matrices to closures
+Generally we can create and use one-way closure attributes to be used on the
back end.
+
+Scalar matrix multiplication:
+
+ val factor: Int = 15
+ val drm2 = drm1.mapBlock() {
+ case (keys, block) => block *= factor
+ keys -> block
+ }
+
+**Closure attributes must be java-serializable. Currently Mahout's in-core
Vectors and Matrices are not java-serializable, and must be broadcast to the
closure using `drmBroadcast(...)`**:
+
+ val v: Vector ...
+ val bcastV = drmBroadcast(v)
+ val drm2 = drm1.mapBlock() {
+ case (keys, block) =>
+ for(row <- 0 until block.nrow) block(row, ::) -= bcastV
+ keys -> block
+ }
+
+#### Computations providing ad-hoc summaries
+
+
+Matrix cardinality:
+
+ drmA.nrow
+ drmA.ncol
+
+*Note: depending on the stage of optimization, These may trigger a
computational action. I.e. if one calls `nrow()` n times, then the back end
will actually recompute `nrow` n times.*
+
+Means and sums:
+
+ drmA.colSums
+ drmA.colMeans
+ drmA.rowSums
+ drmA.rowMeans
+
+
+*Note: These will always trigger a computational action. I.e. if one calls
`colSums()` n times, then the back end will actually recompute `colSums` n
times.*
+
+#### Distributed Matrix Decompositions
+
+To import the decomposition package:
+
+ import org.apache.mahout.math._
+ import decompositions._
+
+Distributed thin QR:
+
+ val (drmQ, incoreR) = dqrThin(drmA)
+
+Distributed SSVD:
+
+ val (drmU, drmV, s) = dssvd(drmA, k = 40, q = 1)
+
+Distributed SPCA:
+
+ val (drmU, drmV, s) = dspca(drmA, k = 30, q = 1)
+
+Distributed regularized ALS:
+
+ val (drmU, drmV, i) = dals(drmA,
+ k = 50,
+ lambda = 0.0,
+ maxIterations = 10,
+ convergenceThreshold = 0.10))
+
+#### Adjusting parallelism of computations
+
+Set the minimum parallelism to 100 for computations on `drmA`:
+
+ drmA.par(min = 100)
+
+Set the exact parallelism to 100 for computations on `drmA`:
+
+ drmA.par(exact = 100)
+
+
+Set the engine specific automatic parallelism adjustment for computations on
`drmA`:
+
+ drmA.par(auto = true)
+
+#### Retrieving the engine specific data structure backing the DRM:
+
+**A Spark RDD:**
+
+ val myRDD = drmA.checkpoint().rdd
+
+**An H2O Frame and Key Vec:**
+
+ val myFrame = drmA.frame
+ val myKeys = drmA.keys
+
+
+For more information including information on Mahout-Samsara's Algebraic
Optimizer and in-core Linear algebra bindings see: [Mahout Scala Bindings and
Mahout Spark Bindings for Linear Algebra
Subroutines](http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf)
+
+
+
+
+
+
+