This is an automated email from the ASF dual-hosted git repository.
baunsgaard pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/systemds.git
The following commit(s) were added to refs/heads/main by this push:
new 15c30ea40d [SYSTEMDS-2834] Python I/O Benchmarking
15c30ea40d is described below
commit 15c30ea40d526b34cce929716a1a2363549c6c44
Author: Kyle Krueger <[email protected]>
AuthorDate: Sat Jun 17 20:10:53 2023 +0200
[SYSTEMDS-2834] Python I/O Benchmarking
This commit extends the performance benchmarks to include a python
benchmark for the transfer of data from the Python API into and out of
systemds. Results include:
double: read.dml; 40.781715454
double: load_native.py; 39.19094614699134
int: read.dml; 32.824596657
int: load_native.py; 36.457156577002024
string: read.dml; 34.440663763
string: load_native.py; 38.71029913998791
boolean: read.dml; 33.266684618
boolean: load_native.py; 36.68671202700352
double: load_numpy.py; 32.85507999898982
double: load_pandas.py; 512.6433556610136
float: load_numpy.py; 38.261559439997654
float: load_pandas.py; 546.0650390849914
long: load_numpy.py; 39.400702337006805
long: load_pandas.py; 536.5950958920002
int64: load_numpy.py; 32.98173662999761
int64: load_pandas.py; 487.0634801320266
int32: load_numpy.py; 32.48500068101566
int32: load_pandas.py; 489.97116349000135
uint8: load_numpy.py; 31.86706029099878
uint8: load_pandas.py; 496.9151880980062
string: load_pandas.py; 504.3096235789999
bool: load_numpy.py; 33.19832509398111
bool: load_pandas.py; 479.9256292580103
Pandas reading and writing is underperforming and need to be
refined, while numpy transfer is on par with normal reads.
Both instances indicate potentials for improvements, especially
pandas.
Closes #1847
---
scripts/perftest/README.md | 27 +++++--
scripts/perftest/datagen/genClusteringData.sh | 6 +-
.../perftest/datagen/genDimensionReductionData.sh | 6 +-
.../{genDimensionReductionData.sh => genIOData.sh} | 33 +++++---
.../{runPCA.sh => python/io/load_native.py} | 51 ++++++++-----
scripts/perftest/python/io/load_numpy.py | 89 ++++++++++++++++++++++
scripts/perftest/python/io/load_pandas.py | 87 +++++++++++++++++++++
scripts/perftest/runAll.sh | 3 +
scripts/perftest/runAllDimensionReduction.sh | 6 +-
scripts/perftest/runAllIO.sh | 82 ++++++++++++++++++++
scripts/perftest/runAllMultinomial.sh | 1 -
scripts/perftest/{runPCA.sh => runIO.sh} | 39 ++++++----
scripts/perftest/runPCA.sh | 2 +-
scripts/perftest/{runPCA.sh => scripts/read.dml} | 27 +------
scripts/utils/generateData.dml | 3 +-
src/main/python/systemds/utils/converters.py | 5 ++
16 files changed, 382 insertions(+), 85 deletions(-)
diff --git a/scripts/perftest/README.md b/scripts/perftest/README.md
index 44939391ca..14ea405b3a 100755
--- a/scripts/perftest/README.md
+++ b/scripts/perftest/README.md
@@ -17,18 +17,35 @@ limitations under the License.
{% end comment %}
-->
-# Performance tests SystemDS
+# Performance Tests SystemDS
-To run all performance tests for SystemDS, simply download systemds, install
the prerequisites and execute.
+To run all performance tests for SystemDS:
+ * install systemds,
+ * install the prerequisites,
+ * navigate to the perftest directory $`cd $SYSTEMDS_ROOT/scripts/perftest`
+ * generate the data,
+ * and execute.
There are a few prerequisites:
+## Install SystemDS
+
- First follow the install guide:
<http://apache.github.io/systemds/site/install> and build the project.
+- Install the python package for python api benchmarks:
<https://apache.github.io/systemds/api/python/getting_started/install.html>
+- Prepare to run SystemDS: <https://apache.github.io/systemds/site/run>
+
+## Install Additional Prerequisites
- Setup Intel MKL: <http://apache.github.io/systemds/site/run>
- Setup OpenBlas:
<https://github.com/xianyi/OpenBLAS/wiki/Precompiled-installation-packages>
- Install Perf stat:
<https://linoxide.com/linux-how-to/install-perf-tool-centos-ubuntu/>
-## NOTE THE SCRIPT HAS TO BE RUN FROM THE PERFTEST FOLDER
+## Generate Test Data
+
+Using the scripts found in `$SYSTEMDS_ROOT/scripts/perftest/datagen`, generate
the data for the tests you want to run. Note the sometimes optional and other
times required parameters/args. Dataset size is likely the most important of
these.
+
+## Run the Benchmarks
+
+**Reminder: The scripts should be run from the perftest folder.**
Examples:
@@ -36,7 +53,7 @@ Examples:
./runAll.sh
```
-Look inside the runAll script to see how to run individual tests.
+Or look inside the runAll script to see how to run individual tests.
-Time calculations in the bash scripts additionally subtract a number, e.g.
".4".
+Time calculations in the bash scripts may additionally subtract a number, e.g.
".4".
This is done to accommodate for time lost by shell script and JVM startup
overheads, to match the actual application runtime of SystemML.
diff --git a/scripts/perftest/datagen/genClusteringData.sh
b/scripts/perftest/datagen/genClusteringData.sh
index 9fb1e9db45..35c49aaa6c 100755
--- a/scripts/perftest/datagen/genClusteringData.sh
+++ b/scripts/perftest/datagen/genClusteringData.sh
@@ -25,9 +25,9 @@ then
exit 1;
fi
-CMD=$1
-BASE=$2/clustering
-MAXMEM=$3
+CMD=${1:-systemds}
+BASE=${2:-"temp"}/clustering
+MAXMEM=${3:-80}
FORMAT="binary"
DENSE_SP=0.9
diff --git a/scripts/perftest/datagen/genDimensionReductionData.sh
b/scripts/perftest/datagen/genDimensionReductionData.sh
index 1207a0dc41..2f6cc21b16 100755
--- a/scripts/perftest/datagen/genDimensionReductionData.sh
+++ b/scripts/perftest/datagen/genDimensionReductionData.sh
@@ -25,9 +25,9 @@ then
exit 1;
fi
-CMD=$1
-BASE=$2/dimensionreduction
-MAXMEM=$3
+CMD=${1:-systemds}
+BASE=${2:-"temp"}/dimensionreduction
+MAXMEM=${3:-80}
FORMAT="binary"
diff --git a/scripts/perftest/datagen/genDimensionReductionData.sh
b/scripts/perftest/datagen/genIOData.sh
similarity index 57%
copy from scripts/perftest/datagen/genDimensionReductionData.sh
copy to scripts/perftest/datagen/genIOData.sh
index 1207a0dc41..46154f8636 100755
--- a/scripts/perftest/datagen/genDimensionReductionData.sh
+++ b/scripts/perftest/datagen/genIOData.sh
@@ -25,37 +25,48 @@ then
exit 1;
fi
-CMD=$1
-BASE=$2/dimensionreduction
-MAXMEM=$3
+CMD=${1:-systemds}
+DATADIR=${2:-"temp"}/io
+MAXMEM=${3:-1}
-FORMAT="binary"
+FORMAT="csv" # can be csv, mm, text, binary
-echo "-- Generating Dimension Reduction data." >> results/times.txt;
+echo "-- Generating IO data." >> results/times.txt;
+
+
+#generate XS scenarios (10MB)
+if [ $MAXMEM -ge 1 ]; then
+ ${CMD} -f ../utils/generateData.dml --nvargs Path=${DATADIR}/X500_250_dense
R=500 C=250 Fmt=$FORMAT &
+fi
+
+#generate XS scenarios (10MB)
+if [ $MAXMEM -ge 10 ]; then
+ ${CMD} -f ../utils/generateData.dml --nvargs Path=${DATADIR}/X5k_250_dense
R=5000 C=250 Fmt=$FORMAT &
+fi
#generate XS scenarios (80MB)
if [ $MAXMEM -ge 80 ]; then
- ${CMD} -f ../datagen/genRandData4PCA.dml --nvargs R=5000 C=2000
OUT=$BASE/pcaData5k_2k_dense FMT=$FORMAT &
+ ${CMD} -f ../utils/generateData.dml --nvargs Path=${DATADIR}/X10k_1k_dense
R=10000 C=1000 Fmt=$FORMAT &
fi
#generate S scenarios (800MB)
if [ $MAXMEM -ge 800 ]; then
- ${CMD} -f ../datagen/genRandData4PCA.dml --nvargs R=50000 C=2000
OUT=$BASE/pcaData50k_2k_dense FMT=$FORMAT &
+ ${CMD} -f ../utils/generateData.dml --nvargs Path=${DATADIR}/X100k_1k_dense
R=100000 C=1000 Fmt=$FORMAT &
fi
#generate M scenarios (8GB)
if [ $MAXMEM -ge 8000 ]; then
- ${CMD} -f ../datagen/genRandData4PCA.dml --nvargs R=500000 C=2000
OUT=$BASE/pcaData500k_2k_dense FMT=$FORMAT &
+ ${CMD} -f ../utils/generateData.dml --nvargs Path=${DATADIR}/X1M_1k_dense
R=1000000 C=1000 Fmt=$FORMAT &
fi
#generate L scenarios (80GB)
if [ $MAXMEM -ge 80000 ]; then
- ${CMD} -f ../datagen/genRandData4PCA.dml --nvargs R=5000000 C=2000
OUT=$BASE/pcaData5M_2k_dense FMT=$FORMAT
+ ${CMD} -f ../utils/generateData.dml --nvargs Path=${DATADIR}/X10M_1k_dense
R=10000000 C=1000 Fmt=$FORMAT &
fi
#generate XL scenarios (800GB)
if [ $MAXMEM -ge 800000 ]; then
- ${CMD} -f ${EXTRADOT}./datagen/genRandData4PCA.dml --nvargs R=50000000
C=2000 OUT=$BASE/pcaData50M_2k_dense FMT=$FORMAT
+ ${CMD} -f ../utils/generateData.dml --nvargs Path=${DATADIR}/X100M_1k_dense
R=100000000 C=1000 Fmt=$FORMAT &
fi
-wait
\ No newline at end of file
+wait
diff --git a/scripts/perftest/runPCA.sh
b/scripts/perftest/python/io/load_native.py
old mode 100755
new mode 100644
similarity index 50%
copy from scripts/perftest/runPCA.sh
copy to scripts/perftest/python/io/load_native.py
index 66fd356005..aa1d89156b
--- a/scripts/perftest/runPCA.sh
+++ b/scripts/perftest/python/io/load_native.py
@@ -1,4 +1,3 @@
-#!/bin/bash
#-------------------------------------------------------------
#
# Licensed to the Apache Software Foundation (ASF) under one
@@ -8,9 +7,9 @@
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
-#
+#
# http://www.apache.org/licenses/LICENSE-2.0
-#
+#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
@@ -19,25 +18,39 @@
# under the License.
#
#-------------------------------------------------------------
-set -e
-if [ "$(basename $PWD)" != "perftest" ];
-then
- echo "Please execute scripts from directory 'perftest'"
- exit 1;
-fi
+import argparse
+import timeit
+
+
+setup = "\n".join(
+ [
+ "from systemds.context import SystemDSContext",
+ "from systemds.script_building.script import DMLScript",
+ ]
+)
+
-CMD=$3
-BASE=$2
+run = "\n".join(
+ [
+ "with SystemDSContext(logging_level=10, py4j_logging_level=50) as
ctx:",
+ " node = ctx.read(src)",
+ " script = DMLScript(ctx)",
+ " script.build_code(node)",
+ " script.execute()",
+ ]
+)
-tstart=$(date +%s.%N)
-# ${CMD} -f ../algorithms/PCA.dml \
-${CMD} -f ./scripts/PCA.dml \
- --config conf/SystemDS-config.xml \
- --stats \
- --nvargs INPUT=$1 SCALE=1 PROJDATA=1 OUTPUT=${BASE}/output
+def main(args):
+ gvars = {"src": args.src}
+ print(timeit.timeit(run, setup, globals=gvars, number=args.number))
-ttrain=$(echo "$(date +%s.%N) - $tstart - .4" | bc)
-echo "PCA on "$1": "$ttrain >> results/times.txt
+if __name__ == "__main__":
+ description = "Benchmarks time spent loading data into systemds"
+ parser = argparse.ArgumentParser(description=description)
+ parser.add_argument("src")
+ parser.add_argument("number", type=int, help="number of times to load the
data")
+ args = parser.parse_args()
+ main(args)
diff --git a/scripts/perftest/python/io/load_numpy.py
b/scripts/perftest/python/io/load_numpy.py
new file mode 100644
index 0000000000..8bd489b064
--- /dev/null
+++ b/scripts/perftest/python/io/load_numpy.py
@@ -0,0 +1,89 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+
+import argparse
+import timeit
+
+setup = "\n".join(
+ [
+ "from systemds.context import SystemDSContext",
+ "from systemds.script_building.script import DMLScript",
+ "import numpy as np",
+ "array = np.loadtxt(src, delimiter=',')",
+ "if dtype is not None:",
+ " array = array.astype(dtype)",
+ ]
+)
+
+
+run = "\n".join(
+ [
+ "with SystemDSContext(logging_level=10, py4j_logging_level=50) as
ctx:",
+ " matrix_from_np = ctx.from_numpy(array)",
+ " script = DMLScript(ctx)",
+ " script.add_input_from_python('test', matrix_from_np)",
+ " script.execute()",
+ ]
+)
+
+
+dtype_choices = [
+ "double",
+ "float",
+ "long",
+ "int8",
+ "int16",
+ "int32",
+ "int64",
+ "uint8",
+ "uint16",
+ "uint32",
+ "uint64",
+ "float32",
+ "float64",
+ "string",
+ "bool",
+]
+
+
+def main(args):
+ gvars = {"src": args.src, "dtype": args.dtype}
+ print(timeit.timeit(run, setup, globals=gvars, number=args.number))
+
+
+if __name__ == "__main__":
+ description = "Benchmarks time spent loading data into systemds"
+ parser = argparse.ArgumentParser(description=description)
+ parser.add_argument("src")
+ parser.add_argument("number", type=int, help="number of times to load the
data")
+ help_force_dtype = (
+ "optionally cast all columns to one of the dtype choices in numpy"
+ )
+ parser.add_argument(
+ "--dtype",
+ choices=dtype_choices,
+ required=False,
+ default=None,
+ help=help_force_dtype,
+ )
+ args = parser.parse_args()
+ main(args)
diff --git a/scripts/perftest/python/io/load_pandas.py
b/scripts/perftest/python/io/load_pandas.py
new file mode 100644
index 0000000000..30714ca907
--- /dev/null
+++ b/scripts/perftest/python/io/load_pandas.py
@@ -0,0 +1,87 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+import argparse
+import timeit
+
+setup = "\n".join(
+ [
+ "from systemds.context import SystemDSContext",
+ "from systemds.script_building.script import DMLScript",
+ "import pandas as pd",
+ "df = pd.read_csv(src, header=None)",
+ "if dtype is not None:",
+ " df = df.astype(dtype)",
+ ]
+)
+
+
+run = "\n".join(
+ [
+ "with SystemDSContext(logging_level=10, py4j_logging_level=50) as
ctx:",
+ " frame_from_pandas = ctx.from_pandas(df)",
+ " script = DMLScript(ctx)",
+ " script.add_input_from_python('test', frame_from_pandas)",
+ " script.execute()",
+ ]
+)
+
+dtype_choices = [
+ "double",
+ "float",
+ "long",
+ "int8",
+ "int16",
+ "int32",
+ "int64",
+ "uint8",
+ "uint16",
+ "uint32",
+ "uint64",
+ "float32",
+ "float64",
+ "string",
+ "bool",
+]
+
+
+def main(args):
+ gvars = {"src": args.src, "dtype": args.dtype}
+ print(timeit.timeit(run, setup, globals=gvars, number=args.number))
+
+
+if __name__ == "__main__":
+ description = "Benchmarks time spent loading data into systemds"
+ parser = argparse.ArgumentParser(description=description)
+ parser.add_argument("src")
+ parser.add_argument("number", type=int, help="number of times to load the
data")
+ help_force_dtype = (
+ "optionally cast all columns to one of the dtype choices in pandas"
+ )
+ parser.add_argument(
+ "--dtype",
+ choices=dtype_choices,
+ required=False,
+ default=None,
+ help=help_force_dtype,
+ )
+ args = parser.parse_args()
+ main(args)
diff --git a/scripts/perftest/runAll.sh b/scripts/perftest/runAll.sh
index db315597bf..9b20606c1d 100755
--- a/scripts/perftest/runAll.sh
+++ b/scripts/perftest/runAll.sh
@@ -127,6 +127,9 @@ echo -e "\n\n" >> results/times.txt
./runAllDimensionReduction.sh ${CMD} ${TEMPFOLDER} ${MAXMEM}
./runAllALS.sh ${CMD} ${TEMPFOLDER} ${MAXMEM}
+### IO Benchmarks:
+./runAllIO.sh ${CMD} ${TEMPFOLDER} ${MAXMEM}
+
# TODO The following benchmarks have yet to be written. The decision tree
algorithms additionally need to be fixed.
# add stepwise Linear
# add stepwise GLM
diff --git a/scripts/perftest/runAllDimensionReduction.sh
b/scripts/perftest/runAllDimensionReduction.sh
index e154926689..03955fc160 100755
--- a/scripts/perftest/runAllDimensionReduction.sh
+++ b/scripts/perftest/runAllDimensionReduction.sh
@@ -25,9 +25,9 @@ then
exit 1;
fi
-COMMAND=$1
-BASE=$2/dimensionreduction
-MAXMEM=$3
+CMD=${1:-systemds}
+BASE=${2:-"temp"}/dimensionreduction
+MAXMEM=${3:-80}
FILENAME=$0
err_report() {
diff --git a/scripts/perftest/runAllIO.sh b/scripts/perftest/runAllIO.sh
new file mode 100755
index 0000000000..8e321a7d4e
--- /dev/null
+++ b/scripts/perftest/runAllIO.sh
@@ -0,0 +1,82 @@
+#!/bin/bash
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+if [ "$(basename $PWD)" != "perftest" ];
+then
+ echo "Please execute scripts from directory 'perftest'"
+ exit 1;
+fi
+
+CMD=${1:-"systemds"}
+DATADIR=${2:-"temp"}/io
+MAXMEM=${3:-1}
+REPEATS=${4:-1}
+
+DATA=()
+if [ $MAXMEM -ge 1 ]; then DATA+=("500_250_dense"); fi
+if [ $MAXMEM -ge 10 ]; then DATA+=("5k_250_dense"); fi
+if [ $MAXMEM -ge 80 ]; then DATA+=("10k_1k_dense"); fi
+if [ $MAXMEM -ge 800 ]; then DATA+=("100k_1k_dense"); fi
+if [ $MAXMEM -ge 8000 ]; then DATA+=("1M_1k_dense"); fi
+if [ $MAXMEM -ge 80000 ]; then DATA+=("10M_1k_dense"); fi
+if [ $MAXMEM -ge 800000 ]; then DATA+=("100M_1k_dense"); fi
+
+echo "RUN IO Benchmarks: " $(date) >> results/times.txt;
+
+execute_python_script () {
+ script=$1
+ input=$2
+ repeats=$3
+ DTYPE=$4
+ printf "%-16s " "${script}; " >> results/times.txt;
+ if [ -z "$DTYPE" ]; then
+ TIME_IO=$(python ./python/io/${script} ${input} ${repeats});
+ else
+ TIME_IO=$(python ./python/io/${script} ${input} ${repeats} --dtype
${DTYPE});
+ fi
+ printf "%s\n" "$TIME_IO" >> results/times.txt
+}
+
+for d in ${DATA[@]}
+do
+ echo "-- Running IO benchmarks on "$d >> results/times.txt;
+ DATAFILE="$DATADIR/X$d"
+ F="runIO.sh"
+ for vtype in "double" "int" "string" "boolean"
+ do
+ . ./$F $CMD $DATAFILE $REPEATS $vtype
+ cp "${DATAFILE}.mtd" "${DATAFILE}.mtd.backup"
+ sed -i "s/\"value_type\":.*$/\"value_type\": \"${vtype}\",/"
"${DATAFILE}.mtd"
+ printf "%-10s " "${vtype}: " >> results/times.txt;
+ execute_python_script "load_native.py" $DATAFILE $REPEATS
+ rm "${DATAFILE}.mtd"
+ mv "${DATAFILE}.mtd.backup" "${DATAFILE}.mtd"
+ done
+ for vtype in "double" "float" "long" "int64" "int32" "uint8" "string" "bool"
+ do
+ printf "%-10s " "${vtype}: " >> results/times.txt;
+ execute_python_script "load_numpy.py" $DATAFILE $REPEATS $vtype
+ printf "%-10s " "${vtype}: " >> results/times.txt;
+ execute_python_script "load_pandas.py" $DATAFILE $REPEATS $vtype
+ done
+done
+
+echo -e "\n\n" >> results/times.txt
diff --git a/scripts/perftest/runAllMultinomial.sh
b/scripts/perftest/runAllMultinomial.sh
index 1078c20581..2b878d24ae 100755
--- a/scripts/perftest/runAllMultinomial.sh
+++ b/scripts/perftest/runAllMultinomial.sh
@@ -31,7 +31,6 @@ MAXMEM=$3
if [ "$TEMPFOLDER" == "" ]; then TEMPFOLDER=temp ; fi
BASE=${TEMPFOLDER}/multinomial
-BASE0=${TEMPFOLDER}/binomial
MAXITR=20
FILENAME=$0
diff --git a/scripts/perftest/runPCA.sh b/scripts/perftest/runIO.sh
similarity index 60%
copy from scripts/perftest/runPCA.sh
copy to scripts/perftest/runIO.sh
index 66fd356005..15df74664a 100755
--- a/scripts/perftest/runPCA.sh
+++ b/scripts/perftest/runIO.sh
@@ -1,4 +1,4 @@
-#!/bin/bash
+#!/usr/bin/env bash
#-------------------------------------------------------------
#
# Licensed to the Apache Software Foundation (ASF) under one
@@ -8,9 +8,9 @@
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
-#
+#
# http://www.apache.org/licenses/LICENSE-2.0
-#
+#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
@@ -19,7 +19,6 @@
# under the License.
#
#-------------------------------------------------------------
-set -e
if [ "$(basename $PWD)" != "perftest" ];
then
@@ -27,17 +26,29 @@ then
exit 1;
fi
-CMD=$3
-BASE=$2
-tstart=$(date +%s.%N)
+CMD=$1
+DATA=$2
+REPEAT=${3:-1}
+VTYPE=${4:-"double"}
+DTYPE=${5:-"matrix"}
-# ${CMD} -f ../algorithms/PCA.dml \
-${CMD} -f ./scripts/PCA.dml \
- --config conf/SystemDS-config.xml \
- --stats \
- --nvargs INPUT=$1 SCALE=1 PROJDATA=1 OUTPUT=${BASE}/output
+cp "${DATA}.mtd" "${DATA}.mtd.backup"
+sed -i "s/\"data_type\":.*$/\"data_type\": \"${DTYPE}\",/" "${DATA}.mtd"
+sed -i "s/\"value_type\":.*$/\"value_type\": \"${VTYPE}\",/" "${DATA}.mtd"
+tstart=$(date +%s.%N)
+printf "%-10s " "$VTYPE: " >> results/times.txt;
+printf "%-16s " "read.dml; " >> results/times.txt;
+for n in $(seq $REPEAT)
+do
+ ${CMD} -f ./scripts/read.dml \
+ --config conf/SystemDS-config.xml \
+ --stats \
+ --nvargs INPUT="$DATA"
+done
-ttrain=$(echo "$(date +%s.%N) - $tstart - .4" | bc)
-echo "PCA on "$1": "$ttrain >> results/times.txt
+duration=$(echo "$(date +%s.%N) - $tstart" | bc)
+printf "%s\n" "$duration" >> results/times.txt
+rm "${DATA}.mtd"
+mv "${DATA}.mtd.backup" "${DATA}.mtd"
diff --git a/scripts/perftest/runPCA.sh b/scripts/perftest/runPCA.sh
index 66fd356005..fdb56d4a8f 100755
--- a/scripts/perftest/runPCA.sh
+++ b/scripts/perftest/runPCA.sh
@@ -27,7 +27,7 @@ then
exit 1;
fi
-CMD=$3
+CMD=${3:-systemds}
BASE=$2
tstart=$(date +%s.%N)
diff --git a/scripts/perftest/runPCA.sh b/scripts/perftest/scripts/read.dml
old mode 100755
new mode 100644
similarity index 66%
copy from scripts/perftest/runPCA.sh
copy to scripts/perftest/scripts/read.dml
index 66fd356005..e391926ac2
--- a/scripts/perftest/runPCA.sh
+++ b/scripts/perftest/scripts/read.dml
@@ -1,4 +1,3 @@
-#!/bin/bash
#-------------------------------------------------------------
#
# Licensed to the Apache Software Foundation (ASF) under one
@@ -8,9 +7,9 @@
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
-#
+#
# http://www.apache.org/licenses/LICENSE-2.0
-#
+#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
@@ -19,25 +18,5 @@
# under the License.
#
#-------------------------------------------------------------
-set -e
-
-if [ "$(basename $PWD)" != "perftest" ];
-then
- echo "Please execute scripts from directory 'perftest'"
- exit 1;
-fi
-
-CMD=$3
-BASE=$2
-
-tstart=$(date +%s.%N)
-
-# ${CMD} -f ../algorithms/PCA.dml \
-${CMD} -f ./scripts/PCA.dml \
- --config conf/SystemDS-config.xml \
- --stats \
- --nvargs INPUT=$1 SCALE=1 PROJDATA=1 OUTPUT=${BASE}/output
-
-ttrain=$(echo "$(date +%s.%N) - $tstart - .4" | bc)
-echo "PCA on "$1": "$ttrain >> results/times.txt
+data = read($INPUT);
diff --git a/scripts/utils/generateData.dml b/scripts/utils/generateData.dml
index fd13934901..11a6e700b8 100644
--- a/scripts/utils/generateData.dml
+++ b/scripts/utils/generateData.dml
@@ -45,6 +45,7 @@ minVal = ifdef($Min, 0)
maxVal = ifdef($Max, 10)
pdFunc = ifdef($Pdf, "uniform")
pathUse = ifdef($Path, "/user/bigr/randomData")
+format = ifdef($Fmt, "csv")
A = rand(rows=numRows, cols=numCols, sparsity=sparsityParam, min=minVal,
max=maxVal, pdf="uniform");
-write(A, pathUse, format="csv");
+write(A, pathUse, format=format);
diff --git a/src/main/python/systemds/utils/converters.py
b/src/main/python/systemds/utils/converters.py
index 3d0c8cf146..136e3470ca 100644
--- a/src/main/python/systemds/utils/converters.py
+++ b/src/main/python/systemds/utils/converters.py
@@ -135,6 +135,11 @@ def pandas_to_frame_block(sds, pd_df: pd.DataFrame):
def frame_block_to_pandas(sds, fb: JavaObject):
+ """Converts a FrameBlock object in the JVM to a pandas dataframe.
+
+ :param sds: The current systemds context.
+ :param fb: A pointer to the JVM's FrameBlock object.
+ """
num_rows = fb.getNumRows()
num_cols = fb.getNumColumns()