Shreyas created SPARK-47425:
-------------------------------
Summary: spark-sql does not recognize expressions in repartition
hint
Key: SPARK-47425
URL: https://issues.apache.org/jira/browse/SPARK-47425
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.4.1
Reporter: Shreyas
In Scala, it is possible to do this, to create a bucketed table to not have
many small files.
{code:scala}
df.repartition(expr("pmod(hash(user_id), 200)"))
.write
.mode(SaveMode.Overwrite)
.bucketBy(200, "user_id")
.option("path", output_path)
.saveAsTable("bucketed_table")
{code}
Found [this small
trick|https://towardsdatascience.com/best-practices-for-bucketing-in-spark-sql-ea9f23f7dd53]
to have the same # files as buckets.
However, the equivalent does not work in spark-sql (using repartition hint)
{code:sql}
create table bucketed_table stored as parquet
clustered by (user_id) into 200 buckets
select /*+repartition (pmod(hash(user_id),200)) */
* from df_table
{code}
{{REPARTITION Hint parameter should include columns, but 'pmod('hash('user_id),
200) found.}}
When I instead make a virtual column and use that, Spark is not respecting the
repartition anymore
{code:sql}
create table bucketed_table stored as parquet
clustered by (user_id) into 200 buckets
select /*+repartition (bkt) */
*, pmod(hash(user_id),200) as bkt
from df_table
{code}
{code:java}
$ hdfs dfs -ls -h /user/spark/warehouse/bucket_test.db/bucketed_table| head
Found 101601 items
...{code}
Can the behavior of repartition hint be changed to work like the Scala/Python
equivalent?
Thank you
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]