[systemds] branch main updated: [SYSTEMDS-2834] Python I/O Benchmarking

baunsgaard Fri, 14 Jul 2023 06:38:58 -0700

This is an automated email from the ASF dual-hosted git repository.

baunsgaard pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/systemds.git



The following commit(s) were added to refs/heads/main by this push:
     new 15c30ea40d [SYSTEMDS-2834] Python I/O Benchmarking
15c30ea40d is described below

commit 15c30ea40d526b34cce929716a1a2363549c6c44
Author: Kyle Krueger <[email protected]>
AuthorDate: Sat Jun 17 20:10:53 2023 +0200

    [SYSTEMDS-2834] Python I/O Benchmarking
    
    This commit extends the performance benchmarks to include a python
    benchmark for the transfer of data from the Python API into and out of
    systemds. Results include:
    
    double:    read.dml;        40.781715454
    double:    load_native.py;  39.19094614699134
    int:       read.dml;        32.824596657
    int:       load_native.py;  36.457156577002024
    string:    read.dml;        34.440663763
    string:    load_native.py;  38.71029913998791
    boolean:   read.dml;        33.266684618
    boolean:   load_native.py;  36.68671202700352
    double:    load_numpy.py;   32.85507999898982
    double:    load_pandas.py;  512.6433556610136
    float:     load_numpy.py;   38.261559439997654
    float:     load_pandas.py;  546.0650390849914
    long:      load_numpy.py;   39.400702337006805
    long:      load_pandas.py;  536.5950958920002
    int64:     load_numpy.py;   32.98173662999761
    int64:     load_pandas.py;  487.0634801320266
    int32:     load_numpy.py;   32.48500068101566
    int32:     load_pandas.py;  489.97116349000135
    uint8:     load_numpy.py;   31.86706029099878
    uint8:     load_pandas.py;  496.9151880980062
    string:    load_pandas.py;  504.3096235789999
    bool:      load_numpy.py;   33.19832509398111
    bool:      load_pandas.py;  479.9256292580103
    
    Pandas reading and writing is underperforming and need to be
    refined, while numpy transfer is on par with normal reads.
    Both instances indicate potentials for improvements, especially
    pandas.
    
    Closes #1847
---
 scripts/perftest/README.md                         | 27 +++++--
 scripts/perftest/datagen/genClusteringData.sh      |  6 +-
 .../perftest/datagen/genDimensionReductionData.sh  |  6 +-
 .../{genDimensionReductionData.sh => genIOData.sh} | 33 +++++---
 .../{runPCA.sh => python/io/load_native.py}        | 51 ++++++++-----
 scripts/perftest/python/io/load_numpy.py           | 89 ++++++++++++++++++++++
 scripts/perftest/python/io/load_pandas.py          | 87 +++++++++++++++++++++
 scripts/perftest/runAll.sh                         |  3 +
 scripts/perftest/runAllDimensionReduction.sh       |  6 +-
 scripts/perftest/runAllIO.sh                       | 82 ++++++++++++++++++++
 scripts/perftest/runAllMultinomial.sh              |  1 -
 scripts/perftest/{runPCA.sh => runIO.sh}           | 39 ++++++----
 scripts/perftest/runPCA.sh                         |  2 +-
 scripts/perftest/{runPCA.sh => scripts/read.dml}   | 27 +------
 scripts/utils/generateData.dml                     |  3 +-
 src/main/python/systemds/utils/converters.py       |  5 ++
 16 files changed, 382 insertions(+), 85 deletions(-)

diff --git a/scripts/perftest/README.md b/scripts/perftest/README.md
index 44939391ca..14ea405b3a 100755
--- a/scripts/perftest/README.md
+++ b/scripts/perftest/README.md
@@ -17,18 +17,35 @@ limitations under the License.
 {% end comment %}
 -->
 
-# Performance tests SystemDS
+# Performance Tests SystemDS
 
-To run all performance tests for SystemDS, simply download systemds, install 
the prerequisites and execute.
+To run all performance tests for SystemDS:
+ * install systemds,
+ * install the prerequisites,
+ * navigate to the perftest directory $`cd $SYSTEMDS_ROOT/scripts/perftest` 
+ * generate the data,
+ * and execute.
 
 There are a few prerequisites:
 
+## Install SystemDS
+
 - First follow the install guide: 
<http://apache.github.io/systemds/site/install> and build the project.
+- Install the python package for python api benchmarks: 
<https://apache.github.io/systemds/api/python/getting_started/install.html>
+- Prepare to run SystemDS: <https://apache.github.io/systemds/site/run>
+
+## Install Additional Prerequisites
 - Setup Intel MKL: <http://apache.github.io/systemds/site/run>
 - Setup OpenBlas: 
<https://github.com/xianyi/OpenBLAS/wiki/Precompiled-installation-packages>
 - Install Perf stat: 
<https://linoxide.com/linux-how-to/install-perf-tool-centos-ubuntu/>
 
-## NOTE THE SCRIPT HAS TO BE RUN FROM THE PERFTEST FOLDER
+## Generate Test Data
+
+Using the scripts found in `$SYSTEMDS_ROOT/scripts/perftest/datagen`, generate 
the data for the tests you want to run. Note the sometimes optional and other 
times required parameters/args. Dataset size is likely the most important of 
these.
+
+## Run the Benchmarks
+
+**Reminder: The scripts should be run from the perftest folder.**
 
 Examples:
 
@@ -36,7 +53,7 @@ Examples:
 ./runAll.sh
 ```
 
-Look inside the runAll script to see how to run individual tests.
+Or look inside the runAll script to see how to run individual tests.
 
-Time calculations in the bash scripts additionally subtract a number, e.g. 
".4".
+Time calculations in the bash scripts may additionally subtract a number, e.g. 
".4".
 This is done to accommodate for time lost by shell script and JVM startup 
overheads, to match the actual application runtime of SystemML.
diff --git a/scripts/perftest/datagen/genClusteringData.sh 
b/scripts/perftest/datagen/genClusteringData.sh
index 9fb1e9db45..35c49aaa6c 100755
--- a/scripts/perftest/datagen/genClusteringData.sh
+++ b/scripts/perftest/datagen/genClusteringData.sh
@@ -25,9 +25,9 @@ then
   exit 1;
 fi
 
-CMD=$1
-BASE=$2/clustering
-MAXMEM=$3
+CMD=${1:-systemds}
+BASE=${2:-"temp"}/clustering
+MAXMEM=${3:-80}
 
 FORMAT="binary" 
 DENSE_SP=0.9
diff --git a/scripts/perftest/datagen/genDimensionReductionData.sh 
b/scripts/perftest/datagen/genDimensionReductionData.sh
index 1207a0dc41..2f6cc21b16 100755
--- a/scripts/perftest/datagen/genDimensionReductionData.sh
+++ b/scripts/perftest/datagen/genDimensionReductionData.sh
@@ -25,9 +25,9 @@ then
   exit 1;
 fi
 
-CMD=$1
-BASE=$2/dimensionreduction
-MAXMEM=$3
+CMD=${1:-systemds}
+BASE=${2:-"temp"}/dimensionreduction
+MAXMEM=${3:-80}
 
 FORMAT="binary"
 
diff --git a/scripts/perftest/datagen/genDimensionReductionData.sh 
b/scripts/perftest/datagen/genIOData.sh
similarity index 57%
copy from scripts/perftest/datagen/genDimensionReductionData.sh
copy to scripts/perftest/datagen/genIOData.sh
index 1207a0dc41..46154f8636 100755
--- a/scripts/perftest/datagen/genDimensionReductionData.sh
+++ b/scripts/perftest/datagen/genIOData.sh
@@ -25,37 +25,48 @@ then
   exit 1;
 fi
 
-CMD=$1
-BASE=$2/dimensionreduction
-MAXMEM=$3
+CMD=${1:-systemds}
+DATADIR=${2:-"temp"}/io
+MAXMEM=${3:-1}
 
-FORMAT="binary"
+FORMAT="csv" # can be csv, mm, text, binary
 
-echo "-- Generating Dimension Reduction data." >> results/times.txt;
+echo "-- Generating IO data." >> results/times.txt;
+
+
+#generate XS scenarios (10MB)
+if [ $MAXMEM -ge 1 ]; then
+  ${CMD} -f ../utils/generateData.dml --nvargs Path=${DATADIR}/X500_250_dense 
R=500 C=250 Fmt=$FORMAT &
+fi
+
+#generate XS scenarios (10MB)
+if [ $MAXMEM -ge 10 ]; then
+  ${CMD} -f ../utils/generateData.dml --nvargs Path=${DATADIR}/X5k_250_dense 
R=5000 C=250 Fmt=$FORMAT &
+fi
 
 #generate XS scenarios (80MB)
 if [ $MAXMEM -ge 80 ]; then
-  ${CMD} -f ../datagen/genRandData4PCA.dml --nvargs R=5000 C=2000 
OUT=$BASE/pcaData5k_2k_dense FMT=$FORMAT &
+  ${CMD} -f ../utils/generateData.dml --nvargs Path=${DATADIR}/X10k_1k_dense 
R=10000 C=1000 Fmt=$FORMAT &
 fi
 
 #generate S scenarios (800MB)
 if [ $MAXMEM -ge 800 ]; then
-  ${CMD} -f ../datagen/genRandData4PCA.dml --nvargs R=50000 C=2000 
OUT=$BASE/pcaData50k_2k_dense FMT=$FORMAT &
+  ${CMD} -f ../utils/generateData.dml --nvargs Path=${DATADIR}/X100k_1k_dense 
R=100000 C=1000 Fmt=$FORMAT &
 fi
 
 #generate M scenarios (8GB)
 if [ $MAXMEM -ge 8000 ]; then
-  ${CMD} -f ../datagen/genRandData4PCA.dml --nvargs R=500000 C=2000 
OUT=$BASE/pcaData500k_2k_dense FMT=$FORMAT &
+  ${CMD} -f ../utils/generateData.dml --nvargs Path=${DATADIR}/X1M_1k_dense 
R=1000000 C=1000 Fmt=$FORMAT &
 fi
 
 #generate L scenarios (80GB)
 if [ $MAXMEM -ge 80000 ]; then
-  ${CMD} -f ../datagen/genRandData4PCA.dml --nvargs R=5000000 C=2000 
OUT=$BASE/pcaData5M_2k_dense FMT=$FORMAT
+  ${CMD} -f ../utils/generateData.dml --nvargs Path=${DATADIR}/X10M_1k_dense 
R=10000000 C=1000 Fmt=$FORMAT &
 fi
 
 #generate XL scenarios (800GB)
 if [ $MAXMEM -ge 800000 ]; then
-  ${CMD} -f ${EXTRADOT}./datagen/genRandData4PCA.dml --nvargs R=50000000 
C=2000 OUT=$BASE/pcaData50M_2k_dense FMT=$FORMAT
+  ${CMD} -f ../utils/generateData.dml --nvargs Path=${DATADIR}/X100M_1k_dense 
R=100000000 C=1000 Fmt=$FORMAT &
 fi
 
-wait
\ No newline at end of file
+wait
diff --git a/scripts/perftest/runPCA.sh 
b/scripts/perftest/python/io/load_native.py
old mode 100755
new mode 100644
similarity index 50%
copy from scripts/perftest/runPCA.sh
copy to scripts/perftest/python/io/load_native.py
index 66fd356005..aa1d89156b
--- a/scripts/perftest/runPCA.sh
+++ b/scripts/perftest/python/io/load_native.py
@@ -1,4 +1,3 @@
-#!/bin/bash
 #-------------------------------------------------------------
 #
 # Licensed to the Apache Software Foundation (ASF) under one
@@ -8,9 +7,9 @@
 # to you under the Apache License, Version 2.0 (the
 # "License"); you may not use this file except in compliance
 # with the License.  You may obtain a copy of the License at
-# 
+#
 #   http://www.apache.org/licenses/LICENSE-2.0
-# 
+#
 # Unless required by applicable law or agreed to in writing,
 # software distributed under the License is distributed on an
 # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
@@ -19,25 +18,39 @@
 # under the License.
 #
 #-------------------------------------------------------------
-set -e
 
-if [ "$(basename $PWD)" != "perftest" ];
-then
-  echo "Please execute scripts from directory 'perftest'"
-  exit 1;
-fi
+import argparse
+import timeit
+
+
+setup = "\n".join(
+    [
+        "from systemds.context import SystemDSContext",
+        "from systemds.script_building.script import DMLScript",
+    ]
+)
+
 
-CMD=$3
-BASE=$2
+run = "\n".join(
+    [
+        "with SystemDSContext(logging_level=10, py4j_logging_level=50) as 
ctx:",
+        "    node = ctx.read(src)",
+        "    script = DMLScript(ctx)",
+        "    script.build_code(node)",
+        "    script.execute()",
+    ]
+)
 
-tstart=$(date +%s.%N)
 
-# ${CMD} -f ../algorithms/PCA.dml \
-${CMD} -f ./scripts/PCA.dml \
-  --config conf/SystemDS-config.xml \
-  --stats \
-  --nvargs INPUT=$1 SCALE=1 PROJDATA=1 OUTPUT=${BASE}/output
+def main(args):
+    gvars = {"src": args.src}
+    print(timeit.timeit(run, setup, globals=gvars, number=args.number))
 
-ttrain=$(echo "$(date +%s.%N) - $tstart - .4" | bc)
-echo "PCA on "$1": "$ttrain >> results/times.txt
 
+if __name__ == "__main__":
+    description = "Benchmarks time spent loading data into systemds"
+    parser = argparse.ArgumentParser(description=description)
+    parser.add_argument("src")
+    parser.add_argument("number", type=int, help="number of times to load the 
data")
+    args = parser.parse_args()
+    main(args)
diff --git a/scripts/perftest/python/io/load_numpy.py 
b/scripts/perftest/python/io/load_numpy.py
new file mode 100644
index 0000000000..8bd489b064
--- /dev/null
+++ b/scripts/perftest/python/io/load_numpy.py
@@ -0,0 +1,89 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+
+import argparse
+import timeit
+
+setup = "\n".join(
+    [
+        "from systemds.context import SystemDSContext",
+        "from systemds.script_building.script import DMLScript",
+        "import numpy as np",
+        "array = np.loadtxt(src, delimiter=',')",
+        "if dtype is not None:",
+        "    array = array.astype(dtype)",
+    ]
+)
+
+
+run = "\n".join(
+    [
+        "with SystemDSContext(logging_level=10, py4j_logging_level=50) as 
ctx:",
+        "    matrix_from_np = ctx.from_numpy(array)",
+        "    script = DMLScript(ctx)",
+        "    script.add_input_from_python('test', matrix_from_np)",
+        "    script.execute()",
+    ]
+)
+
+
+dtype_choices = [
+    "double",
+    "float",
+    "long",
+    "int8",
+    "int16",
+    "int32",
+    "int64",
+    "uint8",
+    "uint16",
+    "uint32",
+    "uint64",
+    "float32",
+    "float64",
+    "string",
+    "bool",
+]
+
+
+def main(args):
+    gvars = {"src": args.src, "dtype": args.dtype}
+    print(timeit.timeit(run, setup, globals=gvars, number=args.number))
+
+
+if __name__ == "__main__":
+    description = "Benchmarks time spent loading data into systemds"
+    parser = argparse.ArgumentParser(description=description)
+    parser.add_argument("src")
+    parser.add_argument("number", type=int, help="number of times to load the 
data")
+    help_force_dtype = (
+        "optionally cast all columns to one of the dtype choices in numpy"
+    )
+    parser.add_argument(
+        "--dtype",
+        choices=dtype_choices,
+        required=False,
+        default=None,
+        help=help_force_dtype,
+    )
+    args = parser.parse_args()
+    main(args)
diff --git a/scripts/perftest/python/io/load_pandas.py 
b/scripts/perftest/python/io/load_pandas.py
new file mode 100644
index 0000000000..30714ca907
--- /dev/null
+++ b/scripts/perftest/python/io/load_pandas.py
@@ -0,0 +1,87 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+import argparse
+import timeit
+
+setup = "\n".join(
+    [
+        "from systemds.context import SystemDSContext",
+        "from systemds.script_building.script import DMLScript",
+        "import pandas as pd",
+        "df = pd.read_csv(src, header=None)",
+        "if dtype is not None:",
+        "    df = df.astype(dtype)",
+    ]
+)
+
+
+run = "\n".join(
+    [
+        "with SystemDSContext(logging_level=10, py4j_logging_level=50) as 
ctx:",
+        "    frame_from_pandas = ctx.from_pandas(df)",
+        "    script = DMLScript(ctx)",
+        "    script.add_input_from_python('test', frame_from_pandas)",
+        "    script.execute()",
+    ]
+)
+
+dtype_choices = [
+    "double",
+    "float",
+    "long",
+    "int8",
+    "int16",
+    "int32",
+    "int64",
+    "uint8",
+    "uint16",
+    "uint32",
+    "uint64",
+    "float32",
+    "float64",
+    "string",
+    "bool",
+]
+
+
+def main(args):
+    gvars = {"src": args.src, "dtype": args.dtype}
+    print(timeit.timeit(run, setup, globals=gvars, number=args.number))
+
+
+if __name__ == "__main__":
+    description = "Benchmarks time spent loading data into systemds"
+    parser = argparse.ArgumentParser(description=description)
+    parser.add_argument("src")
+    parser.add_argument("number", type=int, help="number of times to load the 
data")
+    help_force_dtype = (
+        "optionally cast all columns to one of the dtype choices in pandas"
+    )
+    parser.add_argument(
+        "--dtype",
+        choices=dtype_choices,
+        required=False,
+        default=None,
+        help=help_force_dtype,
+    )
+    args = parser.parse_args()
+    main(args)
diff --git a/scripts/perftest/runAll.sh b/scripts/perftest/runAll.sh
index db315597bf..9b20606c1d 100755
--- a/scripts/perftest/runAll.sh
+++ b/scripts/perftest/runAll.sh
@@ -127,6 +127,9 @@ echo -e "\n\n" >> results/times.txt
 ./runAllDimensionReduction.sh ${CMD} ${TEMPFOLDER} ${MAXMEM}
 ./runAllALS.sh ${CMD} ${TEMPFOLDER} ${MAXMEM}
 
+### IO Benchmarks:
+./runAllIO.sh ${CMD} ${TEMPFOLDER} ${MAXMEM}
+
 # TODO The following benchmarks have yet to be written. The decision tree 
algorithms additionally need to be fixed.
 # add stepwise Linear 
 # add stepwise GLM
diff --git a/scripts/perftest/runAllDimensionReduction.sh 
b/scripts/perftest/runAllDimensionReduction.sh
index e154926689..03955fc160 100755
--- a/scripts/perftest/runAllDimensionReduction.sh
+++ b/scripts/perftest/runAllDimensionReduction.sh
@@ -25,9 +25,9 @@ then
   exit 1;
 fi
 
-COMMAND=$1
-BASE=$2/dimensionreduction
-MAXMEM=$3
+CMD=${1:-systemds}
+BASE=${2:-"temp"}/dimensionreduction
+MAXMEM=${3:-80}
 
 FILENAME=$0
 err_report() {
diff --git a/scripts/perftest/runAllIO.sh b/scripts/perftest/runAllIO.sh
new file mode 100755
index 0000000000..8e321a7d4e
--- /dev/null
+++ b/scripts/perftest/runAllIO.sh
@@ -0,0 +1,82 @@
+#!/bin/bash
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+if [ "$(basename $PWD)" != "perftest" ];
+then
+  echo "Please execute scripts from directory 'perftest'"
+  exit 1;
+fi
+
+CMD=${1:-"systemds"}
+DATADIR=${2:-"temp"}/io
+MAXMEM=${3:-1}
+REPEATS=${4:-1}
+
+DATA=()
+if [ $MAXMEM -ge 1 ]; then DATA+=("500_250_dense"); fi
+if [ $MAXMEM -ge 10 ]; then DATA+=("5k_250_dense"); fi
+if [ $MAXMEM -ge 80 ]; then DATA+=("10k_1k_dense"); fi
+if [ $MAXMEM -ge 800 ]; then DATA+=("100k_1k_dense"); fi
+if [ $MAXMEM -ge 8000 ]; then DATA+=("1M_1k_dense"); fi
+if [ $MAXMEM -ge 80000 ]; then DATA+=("10M_1k_dense"); fi
+if [ $MAXMEM -ge 800000 ]; then DATA+=("100M_1k_dense"); fi
+
+echo "RUN IO Benchmarks: " $(date) >> results/times.txt;
+
+execute_python_script () {
+  script=$1
+  input=$2
+  repeats=$3
+  DTYPE=$4
+  printf "%-16s " "${script}; " >> results/times.txt;
+  if [ -z "$DTYPE" ]; then
+    TIME_IO=$(python ./python/io/${script} ${input} ${repeats});
+  else
+    TIME_IO=$(python ./python/io/${script} ${input} ${repeats} --dtype 
${DTYPE});
+  fi
+  printf "%s\n" "$TIME_IO" >> results/times.txt
+}
+
+for d in ${DATA[@]}
+do
+  echo "-- Running IO benchmarks on "$d >> results/times.txt;
+  DATAFILE="$DATADIR/X$d"
+  F="runIO.sh" 
+  for vtype in "double" "int" "string" "boolean"
+  do
+    . ./$F $CMD $DATAFILE $REPEATS $vtype
+    cp "${DATAFILE}.mtd" "${DATAFILE}.mtd.backup" 
+    sed -i "s/\"value_type\":.*$/\"value_type\": \"${vtype}\",/" 
"${DATAFILE}.mtd"
+    printf "%-10s " "${vtype}: " >> results/times.txt;
+    execute_python_script "load_native.py" $DATAFILE $REPEATS
+    rm "${DATAFILE}.mtd"
+    mv "${DATAFILE}.mtd.backup" "${DATAFILE}.mtd"
+  done
+  for vtype in "double" "float" "long" "int64" "int32" "uint8" "string" "bool"
+  do
+    printf "%-10s " "${vtype}: " >> results/times.txt;
+    execute_python_script "load_numpy.py" $DATAFILE $REPEATS $vtype
+    printf "%-10s " "${vtype}: " >> results/times.txt;
+    execute_python_script "load_pandas.py" $DATAFILE $REPEATS $vtype
+  done
+done
+
+echo -e "\n\n" >> results/times.txt
diff --git a/scripts/perftest/runAllMultinomial.sh 
b/scripts/perftest/runAllMultinomial.sh
index 1078c20581..2b878d24ae 100755
--- a/scripts/perftest/runAllMultinomial.sh
+++ b/scripts/perftest/runAllMultinomial.sh
@@ -31,7 +31,6 @@ MAXMEM=$3
 
 if [ "$TEMPFOLDER" == "" ]; then TEMPFOLDER=temp ; fi
 BASE=${TEMPFOLDER}/multinomial
-BASE0=${TEMPFOLDER}/binomial
 MAXITR=20
 
 FILENAME=$0
diff --git a/scripts/perftest/runPCA.sh b/scripts/perftest/runIO.sh
similarity index 60%
copy from scripts/perftest/runPCA.sh
copy to scripts/perftest/runIO.sh
index 66fd356005..15df74664a 100755
--- a/scripts/perftest/runPCA.sh
+++ b/scripts/perftest/runIO.sh
@@ -1,4 +1,4 @@
-#!/bin/bash
+#!/usr/bin/env bash
 #-------------------------------------------------------------
 #
 # Licensed to the Apache Software Foundation (ASF) under one
@@ -8,9 +8,9 @@
 # to you under the Apache License, Version 2.0 (the
 # "License"); you may not use this file except in compliance
 # with the License.  You may obtain a copy of the License at
-# 
+#
 #   http://www.apache.org/licenses/LICENSE-2.0
-# 
+#
 # Unless required by applicable law or agreed to in writing,
 # software distributed under the License is distributed on an
 # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
@@ -19,7 +19,6 @@
 # under the License.
 #
 #-------------------------------------------------------------
-set -e
 
 if [ "$(basename $PWD)" != "perftest" ];
 then
@@ -27,17 +26,29 @@ then
   exit 1;
 fi
 
-CMD=$3
-BASE=$2
 
-tstart=$(date +%s.%N)
+CMD=$1
+DATA=$2
+REPEAT=${3:-1}
+VTYPE=${4:-"double"}
+DTYPE=${5:-"matrix"}
 
-# ${CMD} -f ../algorithms/PCA.dml \
-${CMD} -f ./scripts/PCA.dml \
-  --config conf/SystemDS-config.xml \
-  --stats \
-  --nvargs INPUT=$1 SCALE=1 PROJDATA=1 OUTPUT=${BASE}/output
+cp "${DATA}.mtd" "${DATA}.mtd.backup"
+sed -i "s/\"data_type\":.*$/\"data_type\": \"${DTYPE}\",/" "${DATA}.mtd"
+sed -i "s/\"value_type\":.*$/\"value_type\": \"${VTYPE}\",/" "${DATA}.mtd"
+tstart=$(date +%s.%N)
+printf "%-10s " "$VTYPE: " >> results/times.txt;
+printf "%-16s " "read.dml; " >> results/times.txt;
+for n in $(seq $REPEAT)
+do
+  ${CMD} -f ./scripts/read.dml \
+    --config conf/SystemDS-config.xml \
+    --stats \
+    --nvargs INPUT="$DATA"
+done
 
-ttrain=$(echo "$(date +%s.%N) - $tstart - .4" | bc)
-echo "PCA on "$1": "$ttrain >> results/times.txt
+duration=$(echo "$(date +%s.%N) - $tstart" | bc)
+printf "%s\n" "$duration" >> results/times.txt
+rm "${DATA}.mtd"
+mv "${DATA}.mtd.backup" "${DATA}.mtd"
 
diff --git a/scripts/perftest/runPCA.sh b/scripts/perftest/runPCA.sh
index 66fd356005..fdb56d4a8f 100755
--- a/scripts/perftest/runPCA.sh
+++ b/scripts/perftest/runPCA.sh
@@ -27,7 +27,7 @@ then
   exit 1;
 fi
 
-CMD=$3
+CMD=${3:-systemds}
 BASE=$2
 
 tstart=$(date +%s.%N)
diff --git a/scripts/perftest/runPCA.sh b/scripts/perftest/scripts/read.dml
old mode 100755
new mode 100644
similarity index 66%
copy from scripts/perftest/runPCA.sh
copy to scripts/perftest/scripts/read.dml
index 66fd356005..e391926ac2
--- a/scripts/perftest/runPCA.sh
+++ b/scripts/perftest/scripts/read.dml
@@ -1,4 +1,3 @@
-#!/bin/bash
 #-------------------------------------------------------------
 #
 # Licensed to the Apache Software Foundation (ASF) under one
@@ -8,9 +7,9 @@
 # to you under the Apache License, Version 2.0 (the
 # "License"); you may not use this file except in compliance
 # with the License.  You may obtain a copy of the License at
-# 
+#
 #   http://www.apache.org/licenses/LICENSE-2.0
-# 
+#
 # Unless required by applicable law or agreed to in writing,
 # software distributed under the License is distributed on an
 # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
@@ -19,25 +18,5 @@
 # under the License.
 #
 #-------------------------------------------------------------
-set -e
-
-if [ "$(basename $PWD)" != "perftest" ];
-then
-  echo "Please execute scripts from directory 'perftest'"
-  exit 1;
-fi
-
-CMD=$3
-BASE=$2
-
-tstart=$(date +%s.%N)
-
-# ${CMD} -f ../algorithms/PCA.dml \
-${CMD} -f ./scripts/PCA.dml \
-  --config conf/SystemDS-config.xml \
-  --stats \
-  --nvargs INPUT=$1 SCALE=1 PROJDATA=1 OUTPUT=${BASE}/output
-
-ttrain=$(echo "$(date +%s.%N) - $tstart - .4" | bc)
-echo "PCA on "$1": "$ttrain >> results/times.txt
 
+data = read($INPUT);
diff --git a/scripts/utils/generateData.dml b/scripts/utils/generateData.dml
index fd13934901..11a6e700b8 100644
--- a/scripts/utils/generateData.dml
+++ b/scripts/utils/generateData.dml
@@ -45,6 +45,7 @@ minVal = ifdef($Min, 0)
 maxVal = ifdef($Max, 10)
 pdFunc = ifdef($Pdf, "uniform")
 pathUse = ifdef($Path, "/user/bigr/randomData")
+format = ifdef($Fmt, "csv")
 
 A = rand(rows=numRows, cols=numCols, sparsity=sparsityParam, min=minVal, 
max=maxVal, pdf="uniform");
-write(A, pathUse, format="csv");
+write(A, pathUse, format=format);
diff --git a/src/main/python/systemds/utils/converters.py 
b/src/main/python/systemds/utils/converters.py
index 3d0c8cf146..136e3470ca 100644
--- a/src/main/python/systemds/utils/converters.py
+++ b/src/main/python/systemds/utils/converters.py
@@ -135,6 +135,11 @@ def pandas_to_frame_block(sds, pd_df: pd.DataFrame):
 
 
 def frame_block_to_pandas(sds, fb: JavaObject):
+    """Converts a FrameBlock object in the JVM to a pandas dataframe.
+
+    :param sds: The current systemds context.
+    :param fb: A pointer to the JVM's FrameBlock object.
+    """
 
     num_rows = fb.getNumRows()
     num_cols = fb.getNumColumns()

[systemds] branch main updated: [SYSTEMDS-2834] Python I/O Benchmarking

Reply via email to