nsivabalan edited a comment on issue #1625:
URL: https://github.com/apache/incubator-hudi/issues/1625#issuecomment-629790932
@rolandjohann : I couldn't repro the ever growing hudi table. May be I am
missing something. Can you try my below code and let us know what do you see.
@bvaradar : Could you think of any reason why roland is seeing the every
growing hudi table ?
My initial insert (100k records) took 14Mb in hudi.
single batch of update(2k records) disk size if I write in parquet directly
= 165kb.
Here are my disk sizes after same batch updates repeatedly.
| Round No | Total disk size (du -s -h basePath)|
|----------|----------------------------------|
| 1 | 23Mb |
|2 | 24 Mb |
| 3| 34 Mb |
|4 | 35Mb |
| 5 | 46Mb |
| 6 | 46Mb |
| 7 | 43Mb |
| 8 | 44Mb |
| 9 | 43Mb |
| 10 | 44Mb |
| 11 | 45Mb|
| 12 | 46Mb |
Code to reproduce:
```
// spark-shell
spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
--packages
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
\
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
```
```
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
val tableName = "hudi_trips_mor"
val basePath = "file:///tmp/hudi_trips_mor"
val basePathParquet = "file:///tmp/parquet"
val dataGen = new DataGenerator
val inserts = convertToStringList(dataGen.generateInserts(100000))
val dfInsert = spark.read.json(spark.sparkContext.parallelize(inserts, 10))
dfInsert.write.format("hudi").options(getQuickstartWriteConfigs).option(PRECOMBINE_FIELD_OPT_KEY,
"ts").option(RECORDKEY_FIELD_OPT_KEY,
"uuid").option(PARTITIONPATH_FIELD_OPT_KEY,
"partitionpath").option(STORAGE_TYPE_OPT_KEY,
"MERGE_ON_READ").option(TABLE_NAME, tableName).mode(Append).save(basePath)
val updates = convertToStringList(dataGen.generateUpdates(2000))
val dfUpdates = spark.read.json(spark.sparkContext.parallelize(updates, 2))
dfUpdates.coalesce(1).write.format("parquet").mode(Append).save(basePathParquet)
dfUpdates.coalesce(1).write.format("org.apache.hudi").option("hoodie.insert.shuffle.parallelism",
"2").option("hoodie.upsert.shuffle.parallelism",
"2").option("hoodie.cleaner.commits.retained",
"3").option("hoodie.cleaner.fileversions.retained",
"2").option("hoodie.compact.inline",
"true").option("hoodie.compact.inline.max.delta.commits",
"2").option(OPERATION_OPT_KEY,
UPSERT_OPERATION_OPT_VAL).option(TABLE_TYPE_OPT_KEY, MOR_TABLE_TYPE_OPT_VAL)
.option(RECORDKEY_FIELD_OPT_KEY, "uuid").option(PARTITIONPATH_FIELD_OPT_KEY,
"partitionpath").option(PRECOMBINE_FIELD_OPT_KEY, "ts").option(TABLE_NAME,
tableName).mode(Append).save(basePath)
dfUpdates.coalesce(1).write.format("org.apache.hudi").option("hoodie.insert.shuffle.parallelism",
"2").option("hoodie.upsert.shuffle.parallelism",
"2").option("hoodie.cleaner.commits.retained",
"3").option("hoodie.cleaner.fileversions.retained",
"2").option("hoodie.compact.inline",
"true").option("hoodie.compact.inline.max.delta.commits",
"2").option(OPERATION_OPT_KEY,
UPSERT_OPERATION_OPT_VAL).option(TABLE_TYPE_OPT_KEY, MOR_TABLE_TYPE_OPT_VAL)
.option(RECORDKEY_FIELD_OPT_KEY, "uuid").option(PARTITIONPATH_FIELD_OPT_KEY,
"partitionpath").option(PRECOMBINE_FIELD_OPT_KEY, "ts").option(TABLE_NAME,
tableName).mode(Append).save(basePath)
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]