[ 
https://issues.apache.org/jira/browse/SYSTEMML-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nakul Jindal updated SYSTEMML-918:
----------------------------------
    Attachment: stages.jpg

screenshot from spark web UI.

> Severe performance drop when using csv matrix -> spark df -> systemml binary 
> block
> ----------------------------------------------------------------------------------
>
>                 Key: SYSTEMML-918
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-918
>             Project: SystemML
>          Issue Type: Improvement
>          Components: Runtime
>            Reporter: Nakul Jindal
>         Attachments: stages.jpg
>
>
> Imran is trying to read in a csv file which encodes a matrix.
> The dimensions of the matrix are 10000 x 784
> Here is what the metadata looks like:
> {
>     "data_type": "matrix",
>     "value_type": "double",
>     "rows": 10000,
>     "cols": 784,
>     "format": "csv",
>     "header": false,
>     "sep": ",",
>     "description": { "author": "SystemML" }
> }
> There are two ways I read this matrix into the 
> [tSNE|https://github.com/iyounus/incubator-systemml/blob/SYSTEMML-831_Implement_tSNE/scripts/staging/tSNE.dml#L25]
>  algorithm as variable X.
> 1. The csv file is read into a dataframe and then the MLContext api is used 
> to convert it to the binary block format. Here is the code used to create the 
> dataframe:
> {code}
> val X0 = {sqlContext.read
>     .format("com.databricks.spark.csv")
>     .option("header", "false") // Use first line of all files as header
>     .option("inferSchema", "true") // Automatically infer data types
>     .load("hdfs://rr-dense1/user/iyounus/data/mnist_test.csv")
>          }
> // code for tsne ..........
> val X = X0.drop(X0.col("C784")) // Drop last column
> val tsneScript = dml(tsne).in("X", X).out("Y", "C")
> println(Calendar.getInstance().get(Calendar.MINUTE))
> val res = ml.execute(tsneScript)
> {code}
> 2. A DML read statement reads the csv file directly.
> Option 2 happens almost instantaneously. Option 1 takes about 8 minutes on a 
> beefy machine with 40g allocated to the driver and 60g allocated to the 
> executors on a 7 node cluster.
> Looking at the spark web UI and following the code path, it leads us to this 
> function:
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtilsExt.java#L874-L879
> This needs some investigation.
> [~iyounus], [~fschueler]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to