nsivabalan commented on issue #1625:
URL: https://github.com/apache/incubator-hudi/issues/1625#issuecomment-629790932


   @bvaradar : Tried to reproduce locally and couldn't. Are there chance of 
some data skewness?
   @rolandjohann : I couldn't repro the ever growing hudi table. May be I am 
missing something. Can you try my below code and let us know what do you see. 
   
   My initial insert (100k records) took 14Mb in hudi. 
   single batch of update(2k records) disk size if I write in parquet directly 
= 165kb. 
   
   Here are my disk sizes after same batch updates repeatedly.
   
   | Round No | Total disk size (du -s -h basePath)|
   |----------|----------------------------------|
   |    1 | 23Mb |
   |2 | 24 Mb |
   | 3| 34 Mb |
   |4 | 35Mb |
   | 5 | 46Mb |
   | 6 | 46Mb |
   | 7 | 43Mb |
   | 8 | 44Mb |
   | 9 | 43Mb |
   | 10 | 44Mb |
   | 11 | 45Mb|
   | 12 | 46Mb |
   
   
   Code to reproduce: 
   
   ```
   // spark-shell
   spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
     --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   ```
   ```
   import org.apache.hudi.QuickstartUtils._
   import scala.collection.JavaConversions._
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   
   val tableName = "hudi_trips_mor"
   val basePath = "file:///tmp/hudi_trips_mor"
   val basePathParquet = "file:///tmp/parquet"
   val dataGen = new DataGenerator
   
   val inserts = convertToStringList(dataGen.generateInserts(100000))
   val dfInsert = spark.read.json(spark.sparkContext.parallelize(inserts, 10))
   
dfInsert.write.format("hudi").options(getQuickstartWriteConfigs).option(PRECOMBINE_FIELD_OPT_KEY,
 "ts").option(RECORDKEY_FIELD_OPT_KEY, 
"uuid").option(PARTITIONPATH_FIELD_OPT_KEY, 
"partitionpath").option(STORAGE_TYPE_OPT_KEY, 
"MERGE_ON_READ").option(TABLE_NAME, tableName).mode(Append).save(basePath)
   
   val updates = convertToStringList(dataGen.generateUpdates(2000))
   val dfUpdates = spark.read.json(spark.sparkContext.parallelize(updates, 2))
   
   
dfUpdates.coalesce(1).write.format("parquet").mode(Append).save(basePathParquet)
   
   
dfUpdates.coalesce(1).write.format("org.apache.hudi").option("hoodie.insert.shuffle.parallelism",
 "2").option("hoodie.upsert.shuffle.parallelism", 
"2").option("hoodie.cleaner.commits.retained", 
"3").option("hoodie.cleaner.fileversions.retained", 
"2").option("hoodie.compact.inline", 
"true").option("hoodie.compact.inline.max.delta.commits", 
"2").option(OPERATION_OPT_KEY, 
UPSERT_OPERATION_OPT_VAL).option(TABLE_TYPE_OPT_KEY, MOR_TABLE_TYPE_OPT_VAL) 
.option(RECORDKEY_FIELD_OPT_KEY, "uuid").option(PARTITIONPATH_FIELD_OPT_KEY, 
"partitionpath").option(PRECOMBINE_FIELD_OPT_KEY, "ts").option(TABLE_NAME, 
tableName).mode(Append).save(basePath)
   
   
dfUpdates.coalesce(1).write.format("org.apache.hudi").option("hoodie.insert.shuffle.parallelism",
 "2").option("hoodie.upsert.shuffle.parallelism", 
"2").option("hoodie.cleaner.commits.retained", 
"3").option("hoodie.cleaner.fileversions.retained", 
"2").option("hoodie.compact.inline", 
"true").option("hoodie.compact.inline.max.delta.commits", 
"2").option(OPERATION_OPT_KEY, 
UPSERT_OPERATION_OPT_VAL).option(TABLE_TYPE_OPT_KEY, MOR_TABLE_TYPE_OPT_VAL) 
.option(RECORDKEY_FIELD_OPT_KEY, "uuid").option(PARTITIONPATH_FIELD_OPT_KEY, 
"partitionpath").option(PRECOMBINE_FIELD_OPT_KEY, "ts").option(TABLE_NAME, 
tableName).mode(Append).save(basePath)
   ```
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to