rangareddy commented on issue #11532:
URL: https://github.com/apache/hudi/issues/11532#issuecomment-2568049040
Hi @subash-metica
I can reproduce this issue using the following code, and I will create a
Hudi upstream to implement the fix.
```sh
pyspark \
--jars
/opt/hudi/packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0-rc1.jar
\
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog
\
--conf
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
```
```python
df = spark.createDataFrame([(1, 2, 4, 4), (1, 2, 4, 5), (1, 2, 3, 6), (1, 2,
3, 7)] * 1000000, ["a", "b", "c", "d"])
hudi_options = {
"hoodie.table.name": "test_clustering",
"hoodie.datasource.write.table.type": "COPY_ON_WRITE",
"hoodie.datasource.write.recordkey.field": "a,b,c,d",
"hoodie.datasource.write.partitionpath.field": "a,b,c",
"hoodie.datasource.write.table.name": "test_clustering",
"hoodie.datasource.write.keygenerator.class":
"org.apache.hudi.keygen.ComplexKeyGenerator",
"hoodie.datasource.hive_sync.mode": "hms",
"hoodie.datasource.hive_sync.enable": "false",
"hoodie.datasource.hive_sync.partition_extractor_class":
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
"hoodie.datasource.write.hive_style_partitioning": "true",
"hoodie.clean.automatic": "true",
"hoodie.metadata.enable": "true",
"hoodie.clustering.inline": "true",
"hoodie.clustering.inline.max.commits": "1",
"hoodie.cleaner.commits.retained": "2",
"hoodie.clustering.plan.strategy.partition.regex.pattern":
".*c=(4|3).*",
"hoodie.datasource.write.operation": "insert_overwrite",
}
df.write.mode("append").format("hudi").options(**hudi_options).save("/tmp/hudi/test_clustering")
spark.sql("call show_clustering(path => '/tmp/hudi/test_clustering')").show()
```
```python
py4j.protocol.Py4JJavaError: An error occurred while calling o45.sql.
: java.util.NoSuchElementException: No value present in Option
at org.apache.hudi.common.util.Option.get(Option.java:93)
at
org.apache.spark.sql.hudi.command.procedures.ShowClusteringProcedure.$anonfun$call$5(ShowClusteringProcedure.scala:79)
at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163)
at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163)
at scala.collection.immutable.Stream.length(Stream.scala:312)
at scala.collection.SeqLike.size(SeqLike.scala:108)
at scala.collection.SeqLike.size$(SeqLike.scala:108)
at scala.collection.AbstractSeq.size(Seq.scala:45)
at
scala.collection.TraversableOnce.toArray(TraversableOnce.scala:341)
at
scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
at
scala.collection.AbstractTraversable.toArray(Traversable.scala:108)
```
By adding the limit=1 parameter, it worked; however, if I increase the limit
to more than 1, it fails again.
``` sql
>>> spark.sql("call show_clustering(path => '/tmp/hudi/test_clustering',
limit => 1)").show()
+-----------------+----------------+---------+-------------------+
| timestamp|input_group_size| state|involved_partitions|
+-----------------+----------------+---------+-------------------+
|20250102161544309| 2|COMPLETED| *|
+-----------------+----------------+---------+-------------------+
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]