[GitHub] [incubator-hudi] nsivabalan edited a comment on issue #1625: [SUPPORT] MOR upsert table grows in size when ingesting same records

GitBox Sun, 17 May 2020 06:14:24 -0700


nsivabalan edited a comment on issue #1625:
URL: https://github.com/apache/incubator-hudi/issues/1625#issuecomment-629790932



   @rolandjohann : I couldn't repro the ever growing hudi table. May be I am 
missing something. Can you try my below code and let us know what do you see. 
   @bvaradar : Could you think of any reason why roland is seeing the every 
growing hudi table ?
   
   My initial insert (100k records) took 14Mb in hudi. 
   single batch of update(2k records) disk size if I write in parquet directly 
= 165kb. 
   
   Here are my disk sizes after same batch updates repeatedly.
   
   | Round No | Total disk size (du -s -h basePath)|
   |----------|----------------------------------|
   |    1 | 23Mb |
   |2 | 24 Mb |
   | 3| 34 Mb |
   |4 | 35Mb |
   | 5 | 46Mb |
   | 6 | 46Mb |
   | 7 | 43Mb |
   | 8 | 44Mb |
   | 9 | 43Mb |
   | 10 | 44Mb |
   | 11 | 45Mb|
   | 12 | 46Mb |
   
   
   Code to reproduce: 
   
   ```
   // spark-shell
   spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
     --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   ```
   ```
   import org.apache.hudi.QuickstartUtils._
   import scala.collection.JavaConversions._
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   
   val tableName = "hudi_trips_mor"
   val basePath = "file:///tmp/hudi_trips_mor"
   val basePathParquet = "file:///tmp/parquet"
   val dataGen = new DataGenerator
   
   val inserts = convertToStringList(dataGen.generateInserts(100000))
   val dfInsert = spark.read.json(spark.sparkContext.parallelize(inserts, 10))
   
dfInsert.write.format("hudi").options(getQuickstartWriteConfigs).option(PRECOMBINE_FIELD_OPT_KEY,
 "ts").option(RECORDKEY_FIELD_OPT_KEY, 
"uuid").option(PARTITIONPATH_FIELD_OPT_KEY, 
"partitionpath").option(STORAGE_TYPE_OPT_KEY, 
"MERGE_ON_READ").option(TABLE_NAME, tableName).mode(Append).save(basePath)
   
   val updates = convertToStringList(dataGen.generateUpdates(2000))
   val dfUpdates = spark.read.json(spark.sparkContext.parallelize(updates, 2))
   
   
dfUpdates.coalesce(1).write.format("parquet").mode(Append).save(basePathParquet)
   
   
dfUpdates.coalesce(1).write.format("org.apache.hudi").option("hoodie.insert.shuffle.parallelism",
 "2").option("hoodie.upsert.shuffle.parallelism", 
"2").option("hoodie.cleaner.commits.retained", 
"3").option("hoodie.cleaner.fileversions.retained", 
"2").option("hoodie.compact.inline", 
"true").option("hoodie.compact.inline.max.delta.commits", 
"2").option(OPERATION_OPT_KEY, 
UPSERT_OPERATION_OPT_VAL).option(TABLE_TYPE_OPT_KEY, MOR_TABLE_TYPE_OPT_VAL) 
.option(RECORDKEY_FIELD_OPT_KEY, "uuid").option(PARTITIONPATH_FIELD_OPT_KEY, 
"partitionpath").option(PRECOMBINE_FIELD_OPT_KEY, "ts").option(TABLE_NAME, 
tableName).mode(Append).save(basePath)
   
   
dfUpdates.coalesce(1).write.format("org.apache.hudi").option("hoodie.insert.shuffle.parallelism",
 "2").option("hoodie.upsert.shuffle.parallelism", 
"2").option("hoodie.cleaner.commits.retained", 
"3").option("hoodie.cleaner.fileversions.retained", 
"2").option("hoodie.compact.inline", 
"true").option("hoodie.compact.inline.max.delta.commits", 
"2").option(OPERATION_OPT_KEY, 
UPSERT_OPERATION_OPT_VAL).option(TABLE_TYPE_OPT_KEY, MOR_TABLE_TYPE_OPT_VAL) 
.option(RECORDKEY_FIELD_OPT_KEY, "uuid").option(PARTITIONPATH_FIELD_OPT_KEY, 
"partitionpath").option(PRECOMBINE_FIELD_OPT_KEY, "ts").option(TABLE_NAME, 
tableName).mode(Append).save(basePath)
   ```
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-hudi] nsivabalan edited a comment on issue #1625: [SUPPORT] MOR upsert table grows in size when ingesting same records

Reply via email to