[GitHub] [hudi] HEPBO3AH opened a new issue, #8234: [ISSUE] Clustering always rewrites the file even when there is nothing to cluster it with

via GitHub Sun, 19 Mar 2023 13:32:38 -0700


HEPBO3AH opened a new issue, #8234:
URL: https://github.com/apache/hudi/issues/8234


   **Describe the problem you faced**
   
   When clustering triggers it's supposed to go through the selected 
partitions. If there are files that can be clustered, it creates new versions 
and marks the old ones to be cleaned. 
   However, if there is only a single file as the clustering candidate, it will 
be selected and rewritten even though the contents stays 100% the same.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Write into table to create a clustering candidate
   2. Run clustering
   3. Observe new version of the file being created even when there isn't 
anything to cluster
   
   **Expected behavior**
   
   If there is nothing to cluster new version of the file shouldn't be created.
   
   **Environment Description**
   
   * Hudi version : 0.11 / 0.12.2 / 0.13
   
   * Spark version : 3.1.2
   
   * Hive version : 3.1.2
   
   * Hadoop version : ?
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   **Additional context**
   
   Code sample:
   
   ```
     def main(args: Array[String]): Unit = {
       println("Start.")
   
       val numberOfCores = "3"
       val spark = SparkSession
         .builder()
         .master("local[" + numberOfCores + "]")
         .appName("hudi-clustering-duplication")
         .config("spark.eventLog.enabled", "true")
         .config("spark.eventLog.dir", logPath)
         .config("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
         .getOrCreate()
   
       import spark.implicits._
       val df = Seq(RandomStringUtils.randomAlphabetic(5)).toDF("id")
   
       df
         .write
         .format("hudi")
         .option(RECORDKEY_FIELD_NAME.key(), "id")
         .option(PRECOMBINE_FIELD.key(), "id")
         .option(TABLE_TYPE.key(), "COPY_ON_WRITE")
         .option(KEYGENERATOR_CLASS_NAME.key(), 
"org.apache.hudi.keygen.NonpartitionedKeyGenerator")
         .option("hoodie.table.name", "tbl")
         .option("hoodie.copyonwrite.record.size.estimate", "20")
         .option("hoodie.parquet.small.file.limit", "73400320") // 70MB
         .option("hoodie.clustering.inline", "true")
         .option("hoodie.clustering.inline.max.commits", "0")
         .mode(SaveMode.Append)
         .save(basePath + "/tbl")
     }
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] HEPBO3AH opened a new issue, #8234: [ISSUE] Clustering always rewrites the file even when there is nothing to cluster it with

Reply via email to