Hi Maciek,
I've tested several variants for summing "fieldToSum":
First, RDD-style code:
df.as[A].map(_.fieldToSum).reduce(_ + _)
df.as[A].rdd.map(_.fieldToSum).sum()
df.as[A].map(_.fieldToSum).rdd.sum()
All around 30 seconds. "reduce" and "sum" seem to have the same performance,
for this use case at least.
Then with sql.functions.sum:
df.agg(sum("fieldToSum")).collect().head.getAs[Long](0)
0.24 seconds, super fast.
Finally, dataset with column selection:
df.as[A].select($"fieldToSum".as[Long]).reduce(_ + _)
0.18 seconds, super fast again.
(I've also tried replacing my sums and reduces by counts on your advice, but
the performance is unchanged. Apparently, summing does not take much time
compared to accessing data.)
It seems that we need to use the SQL interface to reach the highest level of
performance, which somehow breaks the promise of Dataset (preserving type
safety and having Catalyst and Tungsten performance like datasets).
As for direct access to Row, it seems that it got much slower from 1.6 to 2.0.
I guess, it's because of the fact that Dataframe is now Dataset[Row], and thus
uses the same encoding/decoding mechanism as for any other case class.
Best regards,
Julien
> Le 27 août 2016 à 22:32, Maciej Bryński <[email protected]> a écrit :
>
>
> 2016-08-27 15:27 GMT+02:00 Julien Dumazert <[email protected]
> <mailto:[email protected]>>:
> df.map(row => row.getAs[Long]("fieldToSum")).reduce(_ + _)
>
> I think reduce and sum has very different performance.
> Did you try sql.functions.sum ?
> Or of you want to benchmark access to Row object then count() function will
> be better idea.
>
> Regards,
> --
> Maciek Bryński