[
https://issues.apache.org/jira/browse/SPARK-47425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shreyas updated SPARK-47425:
----------------------------
Description:
In Scala, it is possible to do this, to create a bucketed table to not have
many small files.
{code:scala}
df.repartition(expr("pmod(hash(user_id), 200)"))
.write
.mode(SaveMode.Overwrite)
.bucketBy(200, "user_id")
.option("path", output_path)
.saveAsTable("bucketed_table")
{code}
Found [this small
trick|https://towardsdatascience.com/best-practices-for-bucketing-in-spark-sql-ea9f23f7dd53]
to have the same # files as buckets.
However, the equivalent does not work in spark-sql (using repartition hint)
{code:sql}
create table bucketed_table stored as parquet
clustered by (user_id) into 200 buckets
select /*+ repartition (pmod(hash(user_id),200)) */
* from df_table
{code}
{{REPARTITION Hint parameter should include columns, but 'pmod('hash('user_id),
200) found.}}
When I instead make a virtual column and use that, Spark is not respecting the
repartition anymore
{code:sql}
create table bucketed_table stored as parquet
clustered by (user_id) into 200 buckets
select /*+repartition (bkt) */
*, pmod(hash(user_id),200) as bkt
from df_table
{code}
{code:java}
$ hdfs dfs -ls -h /user/spark/warehouse/bucket_test.db/bucketed_table| head
Found 101601 items
...{code}
Can the behavior of repartition hint be changed to work like the Scala/Python
equivalent?
Thank you
was:
In Scala, it is possible to do this, to create a bucketed table to not have
many small files.
{code:scala}
df.repartition(expr("pmod(hash(user_id), 200)"))
.write
.mode(SaveMode.Overwrite)
.bucketBy(200, "user_id")
.option("path", output_path)
.saveAsTable("bucketed_table")
{code}
Found [this small
trick|https://towardsdatascience.com/best-practices-for-bucketing-in-spark-sql-ea9f23f7dd53]
to have the same # files as buckets.
However, the equivalent does not work in spark-sql (using repartition hint)
{code:sql}
create table bucketed_table stored as parquet
clustered by (user_id) into 200 buckets
select /*+repartition (pmod(hash(user_id),200)) */
* from df_table
{code}
{{REPARTITION Hint parameter should include columns, but 'pmod('hash('user_id),
200) found.}}
When I instead make a virtual column and use that, Spark is not respecting the
repartition anymore
{code:sql}
create table bucketed_table stored as parquet
clustered by (user_id) into 200 buckets
select /*+repartition (bkt) */
*, pmod(hash(user_id),200) as bkt
from df_table
{code}
{code:java}
$ hdfs dfs -ls -h /user/spark/warehouse/bucket_test.db/bucketed_table| head
Found 101601 items
...{code}
Can the behavior of repartition hint be changed to work like the Scala/Python
equivalent?
Thank you
> spark-sql does not recognize expressions in repartition hint
> ------------------------------------------------------------
>
> Key: SPARK-47425
> URL: https://issues.apache.org/jira/browse/SPARK-47425
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.4.1
> Reporter: Shreyas
> Priority: Major
>
> In Scala, it is possible to do this, to create a bucketed table to not have
> many small files.
>
> {code:scala}
> df.repartition(expr("pmod(hash(user_id), 200)"))
> .write
> .mode(SaveMode.Overwrite)
> .bucketBy(200, "user_id")
> .option("path", output_path)
> .saveAsTable("bucketed_table")
> {code}
> Found [this small
> trick|https://towardsdatascience.com/best-practices-for-bucketing-in-spark-sql-ea9f23f7dd53]
> to have the same # files as buckets.
> However, the equivalent does not work in spark-sql (using repartition hint)
> {code:sql}
> create table bucketed_table stored as parquet
> clustered by (user_id) into 200 buckets
> select /*+ repartition (pmod(hash(user_id),200)) */
> * from df_table
> {code}
> {{REPARTITION Hint parameter should include columns, but
> 'pmod('hash('user_id), 200) found.}}
> When I instead make a virtual column and use that, Spark is not respecting
> the repartition anymore
> {code:sql}
> create table bucketed_table stored as parquet
> clustered by (user_id) into 200 buckets
> select /*+repartition (bkt) */
> *, pmod(hash(user_id),200) as bkt
> from df_table
> {code}
> {code:java}
> $ hdfs dfs -ls -h /user/spark/warehouse/bucket_test.db/bucketed_table| head
> Found 101601 items
> ...{code}
> Can the behavior of repartition hint be changed to work like the Scala/Python
> equivalent?
> Thank you
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]