Re: Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

2017-06-24 Thread Koert Kuipers
Dataset/DataFrame has repartition (which can be used to partition by key)
and sortWithinPartitions.

see for example usage here:
https://github.com/tresata/spark-sorted/blob/master/src/main/scala/com/tresata/spark/sorted/sql/GroupSortedDataset.scala#L18

On Fri, Jun 23, 2017 at 5:43 PM, Keith Chapman 
wrote:

> Hi,
>
> I have code that does the following using RDDs,
>
> val outputPartitionCount = 300
> val part = new MyOwnPartitioner(outputPartitionCount)
> val finalRdd = myRdd.repartitionAndSortWithinPartitions(part)
>
> where myRdd is correctly formed as key, value pairs. I am looking convert
> this to use Dataset/Dataframe instead of RDDs, so my question is:
>
> Is there custom partitioning of Dataset/Dataframe implemented in Spark?
> Can I accomplish the partial sort using mapPartitions on the resulting
> partitioned Dataset/Dataframe?
>
> Any thoughts?
>
> Regards,
> Keith.
>
> http://keith-chapman.com
>


Re: Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

2017-06-24 Thread Keith Chapman
Hi Nguyen,

This looks promising and seems like I could achieve it using cluster by.
Thanks for the pointer.

Regards,
Keith.

http://keith-chapman.com

On Sat, Jun 24, 2017 at 5:27 AM, nguyen duc Tuan 
wrote:

> Hi Chapman,
> You can use "cluster by" to do what you want.
> https://deepsense.io/optimize-spark-with-distribute-by-and-cluster-by/
>
> 2017-06-24 17:48 GMT+07:00 Saliya Ekanayake :
>
>> I haven't worked with datasets but would this help
>> https://stackoverflow.com/questions/37513667/how-to-cre
>> ate-a-spark-dataset-from-an-rdd?
>>
>> On Jun 23, 2017 5:43 PM, "Keith Chapman"  wrote:
>>
>>> Hi,
>>>
>>> I have code that does the following using RDDs,
>>>
>>> val outputPartitionCount = 300
>>> val part = new MyOwnPartitioner(outputPartitionCount)
>>> val finalRdd = myRdd.repartitionAndSortWithinPartitions(part)
>>>
>>> where myRdd is correctly formed as key, value pairs. I am looking
>>> convert this to use Dataset/Dataframe instead of RDDs, so my question is:
>>>
>>> Is there custom partitioning of Dataset/Dataframe implemented in Spark?
>>> Can I accomplish the partial sort using mapPartitions on the resulting
>>> partitioned Dataset/Dataframe?
>>>
>>> Any thoughts?
>>>
>>> Regards,
>>> Keith.
>>>
>>> http://keith-chapman.com
>>>
>>
>


Re: Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

2017-06-24 Thread Keith Chapman
Thanks for the pointer Saliya, I'm looking got an equivalent api in
dataset/dataframe for repartitionAndSortWithinPartitions, I've already
converted most of the RDD's to Dataframes.

Regards,
Keith.

http://keith-chapman.com

On Sat, Jun 24, 2017 at 3:48 AM, Saliya Ekanayake  wrote:

> I haven't worked with datasets but would this help https://stackoverflow.
> com/questions/37513667/how-to-create-a-spark-dataset-from-an-rdd?
>
> On Jun 23, 2017 5:43 PM, "Keith Chapman"  wrote:
>
>> Hi,
>>
>> I have code that does the following using RDDs,
>>
>> val outputPartitionCount = 300
>> val part = new MyOwnPartitioner(outputPartitionCount)
>> val finalRdd = myRdd.repartitionAndSortWithinPartitions(part)
>>
>> where myRdd is correctly formed as key, value pairs. I am looking convert
>> this to use Dataset/Dataframe instead of RDDs, so my question is:
>>
>> Is there custom partitioning of Dataset/Dataframe implemented in Spark?
>> Can I accomplish the partial sort using mapPartitions on the resulting
>> partitioned Dataset/Dataframe?
>>
>> Any thoughts?
>>
>> Regards,
>> Keith.
>>
>> http://keith-chapman.com
>>
>


Re: Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

2017-06-24 Thread nguyen duc Tuan
Hi Chapman,
You can use "cluster by" to do what you want.
https://deepsense.io/optimize-spark-with-distribute-by-and-cluster-by/

2017-06-24 17:48 GMT+07:00 Saliya Ekanayake :

> I haven't worked with datasets but would this help https://stackoverflow.
> com/questions/37513667/how-to-create-a-spark-dataset-from-an-rdd?
>
> On Jun 23, 2017 5:43 PM, "Keith Chapman"  wrote:
>
>> Hi,
>>
>> I have code that does the following using RDDs,
>>
>> val outputPartitionCount = 300
>> val part = new MyOwnPartitioner(outputPartitionCount)
>> val finalRdd = myRdd.repartitionAndSortWithinPartitions(part)
>>
>> where myRdd is correctly formed as key, value pairs. I am looking convert
>> this to use Dataset/Dataframe instead of RDDs, so my question is:
>>
>> Is there custom partitioning of Dataset/Dataframe implemented in Spark?
>> Can I accomplish the partial sort using mapPartitions on the resulting
>> partitioned Dataset/Dataframe?
>>
>> Any thoughts?
>>
>> Regards,
>> Keith.
>>
>> http://keith-chapman.com
>>
>


Re: Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

2017-06-24 Thread Saliya Ekanayake
I haven't worked with datasets but would this help
https://stackoverflow.com/questions/37513667/how-to-create-a-spark-dataset-from-an-rdd
?

On Jun 23, 2017 5:43 PM, "Keith Chapman"  wrote:

> Hi,
>
> I have code that does the following using RDDs,
>
> val outputPartitionCount = 300
> val part = new MyOwnPartitioner(outputPartitionCount)
> val finalRdd = myRdd.repartitionAndSortWithinPartitions(part)
>
> where myRdd is correctly formed as key, value pairs. I am looking convert
> this to use Dataset/Dataframe instead of RDDs, so my question is:
>
> Is there custom partitioning of Dataset/Dataframe implemented in Spark?
> Can I accomplish the partial sort using mapPartitions on the resulting
> partitioned Dataset/Dataframe?
>
> Any thoughts?
>
> Regards,
> Keith.
>
> http://keith-chapman.com
>