You can partition and bucket a Dataframe by any column. You can create a column 
using an expression. You can add a paritition_id column to your dataframe, and 
partition/bucket by that column

From: David Diebold <davidjdieb...@gmail.com>
Date: Monday, April 11, 2022 at 11:48 AM
To: "user @spark" <user@spark.apache.org>
Subject: [EXTERNAL] Question about bucketing and custom partitioners


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.


Hello,

I have a few questions related to bucketing and custom partitioning in 
dataframe api.

I am considering bucketing to perform one-side free shuffle join in incremental 
jobs, but there is one thing that I'm not happy with.
Data is likely to grow/skew over time. At some point, i would need to change 
amount of buckets which would provoke shuffle.

Instead of this, I would like to use a custom partitioner, that would replace 
shuffle by narrow transformation.
That is something that was feasible with RDD developer api. For example, I 
could use such partitioning scheme:
partition_id = (nb_partitions-1) * ( hash(column) - Int.minValue) / 
(Int.maxValue - Int.minValue)
When I multiply amount of partitions by 2 each new partition depends only on 
one partition from parent (=> narrow transformation)

So, here are my questions :

1/ Is it possible to use custom partitioner when saving a dataframe with 
bucketing ?
2/ Still with the API dataframe, is it possible to apply custom partitioner to 
a dataframe ?
    Is it possible to repartition the dataframe with a narrow transformation 
like what could be done with RDD ?
    Is there some sort of dataframe developer API ? Do you have any pointers on 
this ?

Thanks !
David

Reply via email to