[ https://issues.apache.org/jira/browse/SYSTEMML-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15492347#comment-15492347 ]
Mike Dusenberry commented on SYSTEMML-909: ------------------------------------------ Okay I slimmed this example down to a more basic one in which I take only 100 rows out of the original dataframe, do the bit of processing, write to a file, and then read that file in as a new DataFrame. This is only a 225 MB DataFrame at this point (100 row subset), and the {{javaRDD}} call takes **3.1 minutes** to process. This occurs even if I cache the DataFrame and run the same script multiple times. This is only on a fraction of the data (100 rows vs. 4.6 million rows). DAG: {{mapPartitionsInternal}} -> {{InMemoryColumnarTableScan}} -> {{ConvertToSafe}} -> {{Exchange}} Any thoughts on this? > `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck. > ------------------------------------------------------------ > > Key: SYSTEMML-909 > URL: https://issues.apache.org/jira/browse/SYSTEMML-909 > Project: SystemML > Issue Type: Improvement > Reporter: Mike Dusenberry > > The {{[determineDataFrameDimensionsIfNeeded(...) | > https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/mlcontext/MLContextConversionUtil.java#L585]}} > function in {{MLContext}} is a major bottleneck, particularly due to the > `javaRDD` call. > The issue I'm seeing is that the javaRDD.count() function causes execution of > the lazy DataFrames I pass in, which are created from another DataFrame via > df.randomSplit([0.8, 0.2]), thus a shuffle occurs. I know that this is going > to happen anyways in the internal conversion, but it wastes a lot of time by > having to also do it in this step too. Assume that I have more data than I > can efficiently cache (~7TB with the potential for much more), so I need to > incur the shuffle step only once on the way into the engine. -- This message was sent by Atlassian JIRA (v6.3.4#6332)