[ https://issues.apache.org/jira/browse/SYSTEMML-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nakul Jindal updated SYSTEMML-918: ---------------------------------- Attachment: stages.jpg screenshot from spark web UI. > Severe performance drop when using csv matrix -> spark df -> systemml binary > block > ---------------------------------------------------------------------------------- > > Key: SYSTEMML-918 > URL: https://issues.apache.org/jira/browse/SYSTEMML-918 > Project: SystemML > Issue Type: Improvement > Components: Runtime > Reporter: Nakul Jindal > Attachments: stages.jpg > > > Imran is trying to read in a csv file which encodes a matrix. > The dimensions of the matrix are 10000 x 784 > Here is what the metadata looks like: > { > "data_type": "matrix", > "value_type": "double", > "rows": 10000, > "cols": 784, > "format": "csv", > "header": false, > "sep": ",", > "description": { "author": "SystemML" } > } > There are two ways I read this matrix into the > [tSNE|https://github.com/iyounus/incubator-systemml/blob/SYSTEMML-831_Implement_tSNE/scripts/staging/tSNE.dml#L25] > algorithm as variable X. > 1. The csv file is read into a dataframe and then the MLContext api is used > to convert it to the binary block format. Here is the code used to create the > dataframe: > {code} > val X0 = {sqlContext.read > .format("com.databricks.spark.csv") > .option("header", "false") // Use first line of all files as header > .option("inferSchema", "true") // Automatically infer data types > .load("hdfs://rr-dense1/user/iyounus/data/mnist_test.csv") > } > // code for tsne .......... > val X = X0.drop(X0.col("C784")) // Drop last column > val tsneScript = dml(tsne).in("X", X).out("Y", "C") > println(Calendar.getInstance().get(Calendar.MINUTE)) > val res = ml.execute(tsneScript) > {code} > 2. A DML read statement reads the csv file directly. > Option 2 happens almost instantaneously. Option 1 takes about 8 minutes on a > beefy machine with 40g allocated to the driver and 60g allocated to the > executors on a 7 node cluster. > Looking at the spark web UI and following the code path, it leads us to this > function: > https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtilsExt.java#L874-L879 > This needs some investigation. > [~iyounus], [~fschueler] -- This message was sent by Atlassian JIRA (v6.3.4#6332)