[ 
https://issues.apache.org/jira/browse/SYSTEMML-952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Dusenberry updated SYSTEMML-952:
-------------------------------------
    Description: 
Currently, we spend a lot of time on {{count}} during the conversions from wide 
DataFrames. When calling {{count}} in Spark on these DataFrames directly, it is 
much quicker to just select one of the simple double columns (say the id 
column) and then {{count}}, in that it it does not have to deserialize the 
heavy vector column as well.

Therefore, we should perform the row count only on the index column, and the 
column count on the first row.

cc [~mboehm7]

  was:
Currently, we spend a lot of time on {{count}} during the conversions from wide 
DataFrames. When calling {{count}} in Spark on these DataFrames directly, it is 
much quicker to just select one of the simple double columns (say the id 
column) and then {{count}}, in that it it does not read in the heavy vector 
column as well.

Therefore, we should perform the row count only on the index column, and the 
column count on the first row.

cc [~mboehm7]


> Efficient Counts During Conversions
> -----------------------------------
>
>                 Key: SYSTEMML-952
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-952
>             Project: SystemML
>          Issue Type: Improvement
>            Reporter: Mike Dusenberry
>
> Currently, we spend a lot of time on {{count}} during the conversions from 
> wide DataFrames. When calling {{count}} in Spark on these DataFrames 
> directly, it is much quicker to just select one of the simple double columns 
> (say the id column) and then {{count}}, in that it it does not have to 
> deserialize the heavy vector column as well.
> Therefore, we should perform the row count only on the index column, and the 
> column count on the first row.
> cc [~mboehm7]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to