[jira] [Commented] (SYSTEMML-909) `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.

Mike Dusenberry (JIRA) Wed, 14 Sep 2016 21:40:53 -0700

    [ 
https://issues.apache.org/jira/browse/SYSTEMML-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15492347#comment-15492347
 ]


Mike Dusenberry commented on SYSTEMML-909:
------------------------------------------

Okay I slimmed this example down to a more basic one in which I take only 100 
rows out of the original dataframe, do the bit of processing, write to a file, 
and then read that file in as a new DataFrame.  This is only a 225 MB DataFrame 
at this point (100 row subset), and the {{javaRDD}} call takes **3.1 minutes** 
to process.  This occurs even if I cache the DataFrame and run the same script 
multiple times.  This is only on a fraction of the data (100 rows vs. 4.6 
million rows).

DAG:
{{mapPartitionsInternal}} -> {{InMemoryColumnarTableScan}} -> {{ConvertToSafe}} 
-> {{Exchange}}

Any thoughts on this?


> `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.
> ------------------------------------------------------------
>
>                 Key: SYSTEMML-909
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-909
>             Project: SystemML
>          Issue Type: Improvement
>            Reporter: Mike Dusenberry
>
> The {{[determineDataFrameDimensionsIfNeeded(...) | 
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/mlcontext/MLContextConversionUtil.java#L585]}}
>  function in {{MLContext}} is a major bottleneck, particularly due to the 
> `javaRDD` call.
> The issue I'm seeing is that the javaRDD.count() function causes execution of 
> the lazy DataFrames I pass in, which are created from another DataFrame via 
> df.randomSplit([0.8, 0.2]), thus a shuffle occurs. I know that this is going 
> to happen anyways in the internal conversion, but it wastes a lot of time by 
> having to also do it in this step too. Assume that I have more data than I 
> can efficiently cache (~7TB with the potential for much more), so I need to 
> incur the shuffle step only once on the way into the engine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SYSTEMML-909) `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.

Reply via email to