Matthias Boehm created SYSTEMML-2170:
----------------------------------------
Summary: Remote parfor fails on reading ultra-sparse matrix with
dims > 2G
Key: SYSTEMML-2170
URL: https://issues.apache.org/jira/browse/SYSTEMML-2170
Project: SystemML
Issue Type: Bug
Reporter: Matthias Boehm
The parfor optimizer has a rewrite to select remote spark execution type even
if in the original program there are Spark operations if these fit into the
memory budget of the executors. However, this rewrite does not check for valid
integer dimensions and hence fails with
{code}
Caused by: org.apache.sysml.runtime.DMLRuntimeException: Matrix dimensions too
large for CP runtime: 3 x 5129281161
at
org.apache.sysml.runtime.io.MatrixReader.createOutputMatrixBlock(MatrixReader.java:80)
at
org.apache.sysml.runtime.io.ReaderBinaryBlockParallel.readMatrixFromHDFS(ReaderBinaryBlockParallel.java:59)
at
org.apache.sysml.runtime.util.DataConverter.readMatrixFromHDFS(DataConverter.java:207)
{code}
Here is the related optimizer output
{code}
----------------------------
EXPLAIN OPT TREE (type=ABSTRACT_PLAN, size=22)
----------------------------
--PARFOR, exec=CP, k=16, dp=NONE, tp=FIXED, rm=LOCAL_AUTOMATIC
----GENERIC (lines 122-126), exec=CP, k=1
------lix, exec=CP, k=1
------b(-), exec=CP, k=1
------b(*), exec=CP, k=1
------r(t), exec=CP, k=16
------ba(+*), exec=CP, k=16
------rix, exec=CP, k=1
------r(rshape), exec=CP, k=16
------ba(+*), exec=CP, k=16
------r(rshape), exec=CP, k=16
------rix, exec=CP, k=1
------r(rshape), exec=SPARK, k=1
------rix, exec=SPARK, k=1
------b(/), exec=CP, k=1
------u(exp), exec=CP, k=16
------b(-), exec=CP, k=1
------rix, exec=CP, k=1
------ua(maxRC), exec=CP, k=16
------ua(+RC), exec=CP, k=16
------b(*), exec=CP, k=1
------ua(+RC), exec=CP, k=16
----------------------------
18/03/06 23:17:33 DEBUG Optimizer: --- RULEBASED OPTIMIZER -------
18/03/06 23:17:33 DEBUG Optimizer: RULEBASED OPT: Optimize w/
max_mem=24271MB/4638MB/4638MB, max_k=16/144/144).
18/03/06 23:17:33 DEBUG Optimizer: RULEBASED OPT: Optimize w/
SparkClusterConfig:
-- legacyVersion = false (2.2.0)
-- confOnly = true
-- numExecutors = 6
-- defaultPar = 144
-- memExecutor = 69478645760
-- memDataMinFrac = 0.5
-- memDataMaxFrac = 0.6
-- memBroadcastFrac = 0.21
18/03/06 23:17:33 DEBUG Optimizer: RULEBASED OPT: estimated mem (serial exec)
M=109MB
18/03/06 23:17:33 DEBUG Optimizer: RULEBASED OPT: rewrite 'set data
partitioner' - result=NONE ()
18/03/06 23:17:33 DEBUG Optimizer: RULEBASED OPT: rewrite 'remove unnecessary
compare matrix' - result=false ()
18/03/06 23:17:33 DEBUG Optimizer: RULEBASED OPT: rewrite 'set result
partitioning' - result=false
18/03/06 23:17:33 DEBUG Optimizer: RULEBASED OPT: estimated new mem (serial
exec) M=109MB
18/03/06 23:17:33 DEBUG Optimizer: RULEBASED OPT: estimated new mem (serial
exec, all CP) M=109MB
18/03/06 23:17:33 DEBUG Optimizer: RULEBASED OPT: estimated new mem (cond
partitioning) M=109MB
18/03/06 23:17:33 DEBUG Optimizer: RULEBASED OPT: rewrite 'set execution
strategy' - result=REMOTE_SPARK (recompile=true)
18/03/06 23:17:33 DEBUG Optimizer: RULEBASED OPT: rewrite 'set operation exec
type CP' - result=2
18/03/06 23:17:33 DEBUG Optimizer: RULEBASED OPT: rewrite 'enable data
colocation' - result=false
18/03/06 23:17:33 DEBUG Optimizer: RULEBASED OPT: rewrite 'set partition
replication factor' - result=false
18/03/06 23:17:33 DEBUG Optimizer: RULEBASED OPT: rewrite 'set export
replication factor' - result=true (3)
18/03/06 23:17:33 DEBUG Optimizer: RULEBASED OPT: rewrite 'set degree of
parallelism' - result=(see EXPLAIN)
18/03/06 23:17:33 DEBUG Optimizer: RULEBASED OPT: rewrite 'set task
partitioner' - result=STATIC
18/03/06 23:17:33 DEBUG Optimizer: RULEBASED OPT: rewrite 'set fused data
partitioning and execution' - result=false
18/03/06 23:17:33 DEBUG Optimizer: RULEBASED OPT: rewrite 'set transpose sparse
vector operations' - result=false
18/03/06 23:17:33 DEBUG Optimizer: RULEBASED OPT: rewrite 'set in-place result
indexing' - result=true ([delta_b_softmax], M=160MB)
18/03/06 23:17:33 DEBUG Optimizer: RULEBASED OPT: rewrite 'disable CP caching'
- result=false (M=160MB)
18/03/06 23:17:33 DEBUG Optimizer: RULEBASED OPT: rewrite 'set result merge' -
result=LOCAL_MEM
18/03/06 23:17:33 DEBUG Optimizer: RULEBASED OPT: rewrite 'set recompile memory
budget' - result=24271MB
18/03/06 23:17:33 DEBUG Optimizer: RULEBASED OPT: rewrite 'remove recursive
parfor' - result=0/0
18/03/06 23:17:33 DEBUG Optimizer: RULEBASED OPT: rewrite 'remove unnecessary
parfor' - result=0
18/03/06 23:17:33 DEBUG OptimizationWrapper: ParFOR Opt: Optimized plan (after
optimization):
----------------------------
EXPLAIN OPT TREE (type=ABSTRACT_PLAN, size=22)
----------------------------
--PARFOR, exec=SPARK, k=3, dp=NONE, tp=STATIC, rm=LOCAL_MEM
----GENERIC (lines 122-126), exec=CP, k=1
------lix, exec=CP, k=1
------b(-), exec=CP, k=1
------b(*), exec=CP, k=1
------r(t), exec=CP, k=1
------ba(+*), exec=CP, k=1
------rix, exec=CP, k=1
------r(rshape), exec=CP, k=1
------ba(+*), exec=CP, k=1
------r(rshape), exec=CP, k=1
------rix, exec=CP, k=1
------r(rshape), exec=CP, k=1
------rix, exec=CP, k=1
------b(/), exec=CP, k=1
------u(exp), exec=CP, k=1
------b(-), exec=CP, k=1
------rix, exec=CP, k=1
------ua(maxRC), exec=CP, k=1
------ua(+RC), exec=CP, k=1
------b(*), exec=CP, k=1
------ua(+RC), exec=CP, k=1
----------------------------
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)