This should be fine. Dataset.groupByKey is a logical operation, not a physical one (as in Spark wouldn’t always materialize all the groups in memory).
On Thu, Feb 28, 2019 at 1:46 AM Etienne Chauchot <echauc...@apache.org> wrote: > Hi all, > > I'm migrating RDD pipelines to Dataset and I saw that Combine.PerKey is no > more there in the Dataset API. So, I translated it to: > > > KeyValueGroupedDataset<K, KV<K, InputT>> groupedDataset = > keyedDataset.groupByKey(KVHelpers.extractKey(), > EncoderHelpers.genericEncoder()); > > Dataset<Tuple2<K, OutputT>> combinedDataset = > groupedDataset.agg( > new Aggregator<KV<K, InputT>, AccumT, OutputT>(combineFn).toColumn()); > > > I have an interrogation regarding performance : as GroupByKey is generally > less performant (entails shuffle and possible OOM if a given key has a lot > of data associated to it), I was wondering if the new spark optimizer > translates such a DAG into a combinePerKey behind the scene. In other > words, is such a DAG going to be translated to a local (or partial I don't > know what terminology you use) combine and then a global combine to avoid > shuffle? > > Thanks > > Etienne >