Re: [I] [SUPPORT] show_clustering is throwing exception in 0.14.1 + Spark 3.4 [hudi]

via GitHub Thu, 02 Jan 2025 08:34:00 -0800


rangareddy commented on issue #11532:
URL: https://github.com/apache/hudi/issues/11532#issuecomment-2568049040


   Hi @subash-metica 
   
   I can reproduce this issue using the following code, and I will create a 
Hudi upstream to implement the fix.
   
   ```sh
   pyspark \
   --jars 
/opt/hudi/packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0-rc1.jar
 \
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
   --conf 
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog 
\
   --conf 
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
   ```
   
   ```python
   df = spark.createDataFrame([(1, 2, 4, 4), (1, 2, 4, 5), (1, 2, 3, 6), (1, 2, 
3, 7)] * 1000000, ["a", "b", "c", "d"])
   
   hudi_options = {
               "hoodie.table.name": "test_clustering",
               "hoodie.datasource.write.table.type": "COPY_ON_WRITE",
               "hoodie.datasource.write.recordkey.field": "a,b,c,d",
               "hoodie.datasource.write.partitionpath.field": "a,b,c",
               "hoodie.datasource.write.table.name": "test_clustering",
               "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
               "hoodie.datasource.hive_sync.mode": "hms",
               "hoodie.datasource.hive_sync.enable": "false",
               "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
               "hoodie.datasource.write.hive_style_partitioning": "true",
               "hoodie.clean.automatic": "true",
               "hoodie.metadata.enable": "true",
               "hoodie.clustering.inline": "true",
               "hoodie.clustering.inline.max.commits": "1",
               "hoodie.cleaner.commits.retained": "2",
               "hoodie.clustering.plan.strategy.partition.regex.pattern": 
".*c=(4|3).*",
               "hoodie.datasource.write.operation": "insert_overwrite",
           }
   
   
df.write.mode("append").format("hudi").options(**hudi_options).save("/tmp/hudi/test_clustering")
   
   spark.sql("call show_clustering(path => '/tmp/hudi/test_clustering')").show()
   ```
   
   ```python
   py4j.protocol.Py4JJavaError: An error occurred while calling o45.sql.
   : java.util.NoSuchElementException: No value present in Option
           at org.apache.hudi.common.util.Option.get(Option.java:93)
           at 
org.apache.spark.sql.hudi.command.procedures.ShowClusteringProcedure.$anonfun$call$5(ShowClusteringProcedure.scala:79)
           at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418)
           at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173)
           at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163)
           at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418)
           at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173)
           at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163)
           at scala.collection.immutable.Stream.length(Stream.scala:312)
           at scala.collection.SeqLike.size(SeqLike.scala:108)
           at scala.collection.SeqLike.size$(SeqLike.scala:108)
           at scala.collection.AbstractSeq.size(Seq.scala:45)
           at 
scala.collection.TraversableOnce.toArray(TraversableOnce.scala:341)
           at 
scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
           at 
scala.collection.AbstractTraversable.toArray(Traversable.scala:108)
   ```
   
   By adding the limit=1 parameter, it worked; however, if I increase the limit 
to more than 1, it fails again.
   
   ```  sql
   >>> spark.sql("call show_clustering(path => '/tmp/hudi/test_clustering', 
limit => 1)").show()
   +-----------------+----------------+---------+-------------------+
   |        timestamp|input_group_size|    state|involved_partitions|
   +-----------------+----------------+---------+-------------------+
   |20250102161544309|               2|COMPLETED|                  *|
   +-----------------+----------------+---------+-------------------+
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [SUPPORT] show_clustering is throwing exception in 0.14.1 + Spark 3.4 [hudi]

Reply via email to