Re: Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?
Dataset/DataFrame has repartition (which can be used to partition by key) and sortWithinPartitions. see for example usage here: https://github.com/tresata/spark-sorted/blob/master/src/main/scala/com/tresata/spark/sorted/sql/GroupSortedDataset.scala#L18 On Fri, Jun 23, 2017 at 5:43 PM, Keith Chapmanwrote: > Hi, > > I have code that does the following using RDDs, > > val outputPartitionCount = 300 > val part = new MyOwnPartitioner(outputPartitionCount) > val finalRdd = myRdd.repartitionAndSortWithinPartitions(part) > > where myRdd is correctly formed as key, value pairs. I am looking convert > this to use Dataset/Dataframe instead of RDDs, so my question is: > > Is there custom partitioning of Dataset/Dataframe implemented in Spark? > Can I accomplish the partial sort using mapPartitions on the resulting > partitioned Dataset/Dataframe? > > Any thoughts? > > Regards, > Keith. > > http://keith-chapman.com >
Re: Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?
Hi Nguyen, This looks promising and seems like I could achieve it using cluster by. Thanks for the pointer. Regards, Keith. http://keith-chapman.com On Sat, Jun 24, 2017 at 5:27 AM, nguyen duc Tuanwrote: > Hi Chapman, > You can use "cluster by" to do what you want. > https://deepsense.io/optimize-spark-with-distribute-by-and-cluster-by/ > > 2017-06-24 17:48 GMT+07:00 Saliya Ekanayake : > >> I haven't worked with datasets but would this help >> https://stackoverflow.com/questions/37513667/how-to-cre >> ate-a-spark-dataset-from-an-rdd? >> >> On Jun 23, 2017 5:43 PM, "Keith Chapman" wrote: >> >>> Hi, >>> >>> I have code that does the following using RDDs, >>> >>> val outputPartitionCount = 300 >>> val part = new MyOwnPartitioner(outputPartitionCount) >>> val finalRdd = myRdd.repartitionAndSortWithinPartitions(part) >>> >>> where myRdd is correctly formed as key, value pairs. I am looking >>> convert this to use Dataset/Dataframe instead of RDDs, so my question is: >>> >>> Is there custom partitioning of Dataset/Dataframe implemented in Spark? >>> Can I accomplish the partial sort using mapPartitions on the resulting >>> partitioned Dataset/Dataframe? >>> >>> Any thoughts? >>> >>> Regards, >>> Keith. >>> >>> http://keith-chapman.com >>> >> >
Re: Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?
Thanks for the pointer Saliya, I'm looking got an equivalent api in dataset/dataframe for repartitionAndSortWithinPartitions, I've already converted most of the RDD's to Dataframes. Regards, Keith. http://keith-chapman.com On Sat, Jun 24, 2017 at 3:48 AM, Saliya Ekanayakewrote: > I haven't worked with datasets but would this help https://stackoverflow. > com/questions/37513667/how-to-create-a-spark-dataset-from-an-rdd? > > On Jun 23, 2017 5:43 PM, "Keith Chapman" wrote: > >> Hi, >> >> I have code that does the following using RDDs, >> >> val outputPartitionCount = 300 >> val part = new MyOwnPartitioner(outputPartitionCount) >> val finalRdd = myRdd.repartitionAndSortWithinPartitions(part) >> >> where myRdd is correctly formed as key, value pairs. I am looking convert >> this to use Dataset/Dataframe instead of RDDs, so my question is: >> >> Is there custom partitioning of Dataset/Dataframe implemented in Spark? >> Can I accomplish the partial sort using mapPartitions on the resulting >> partitioned Dataset/Dataframe? >> >> Any thoughts? >> >> Regards, >> Keith. >> >> http://keith-chapman.com >> >
Re: Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?
Hi Chapman, You can use "cluster by" to do what you want. https://deepsense.io/optimize-spark-with-distribute-by-and-cluster-by/ 2017-06-24 17:48 GMT+07:00 Saliya Ekanayake: > I haven't worked with datasets but would this help https://stackoverflow. > com/questions/37513667/how-to-create-a-spark-dataset-from-an-rdd? > > On Jun 23, 2017 5:43 PM, "Keith Chapman" wrote: > >> Hi, >> >> I have code that does the following using RDDs, >> >> val outputPartitionCount = 300 >> val part = new MyOwnPartitioner(outputPartitionCount) >> val finalRdd = myRdd.repartitionAndSortWithinPartitions(part) >> >> where myRdd is correctly formed as key, value pairs. I am looking convert >> this to use Dataset/Dataframe instead of RDDs, so my question is: >> >> Is there custom partitioning of Dataset/Dataframe implemented in Spark? >> Can I accomplish the partial sort using mapPartitions on the resulting >> partitioned Dataset/Dataframe? >> >> Any thoughts? >> >> Regards, >> Keith. >> >> http://keith-chapman.com >> >
Re: Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?
I haven't worked with datasets but would this help https://stackoverflow.com/questions/37513667/how-to-create-a-spark-dataset-from-an-rdd ? On Jun 23, 2017 5:43 PM, "Keith Chapman"wrote: > Hi, > > I have code that does the following using RDDs, > > val outputPartitionCount = 300 > val part = new MyOwnPartitioner(outputPartitionCount) > val finalRdd = myRdd.repartitionAndSortWithinPartitions(part) > > where myRdd is correctly formed as key, value pairs. I am looking convert > this to use Dataset/Dataframe instead of RDDs, so my question is: > > Is there custom partitioning of Dataset/Dataframe implemented in Spark? > Can I accomplish the partial sort using mapPartitions on the resulting > partitioned Dataset/Dataframe? > > Any thoughts? > > Regards, > Keith. > > http://keith-chapman.com >