nsivabalan commented on issue #1625:
URL: https://github.com/apache/incubator-hudi/issues/1625#issuecomment-629790932
@bvaradar : Tried to reproduce locally and couldn't. Are there chance of
some data skewness?
@rolandjohann : I couldn't repro the ever growing hudi table. May be I am
missing something. Can you try my below code and let us know what do you see.
My initial insert (100k records) took 14Mb in hudi.
single batch of update(2k records) disk size if I write in parquet directly
= 165kb.
Here are my disk sizes after same batch updates repeatedly.
| Round No | Total disk size (du -s -h basePath)|
|----------|----------------------------------|
| 1 | 23Mb |
|2 | 24 Mb |
| 3| 34 Mb |
|4 | 35Mb |
| 5 | 46Mb |
| 6 | 46Mb |
| 7 | 43Mb |
| 8 | 44Mb |
| 9 | 43Mb |
| 10 | 44Mb |
| 11 | 45Mb|
| 12 | 46Mb |
Code to reproduce:
```
// spark-shell
spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
--packages
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
\
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
```
```
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
val tableName = "hudi_trips_mor"
val basePath = "file:///tmp/hudi_trips_mor"
val basePathParquet = "file:///tmp/parquet"
val dataGen = new DataGenerator
val inserts = convertToStringList(dataGen.generateInserts(100000))
val dfInsert = spark.read.json(spark.sparkContext.parallelize(inserts, 10))
dfInsert.write.format("hudi").options(getQuickstartWriteConfigs).option(PRECOMBINE_FIELD_OPT_KEY,
"ts").option(RECORDKEY_FIELD_OPT_KEY,
"uuid").option(PARTITIONPATH_FIELD_OPT_KEY,
"partitionpath").option(STORAGE_TYPE_OPT_KEY,
"MERGE_ON_READ").option(TABLE_NAME, tableName).mode(Append).save(basePath)
val updates = convertToStringList(dataGen.generateUpdates(2000))
val dfUpdates = spark.read.json(spark.sparkContext.parallelize(updates, 2))
dfUpdates.coalesce(1).write.format("parquet").mode(Append).save(basePathParquet)
dfUpdates.coalesce(1).write.format("org.apache.hudi").option("hoodie.insert.shuffle.parallelism",
"2").option("hoodie.upsert.shuffle.parallelism",
"2").option("hoodie.cleaner.commits.retained",
"3").option("hoodie.cleaner.fileversions.retained",
"2").option("hoodie.compact.inline",
"true").option("hoodie.compact.inline.max.delta.commits",
"2").option(OPERATION_OPT_KEY,
UPSERT_OPERATION_OPT_VAL).option(TABLE_TYPE_OPT_KEY, MOR_TABLE_TYPE_OPT_VAL)
.option(RECORDKEY_FIELD_OPT_KEY, "uuid").option(PARTITIONPATH_FIELD_OPT_KEY,
"partitionpath").option(PRECOMBINE_FIELD_OPT_KEY, "ts").option(TABLE_NAME,
tableName).mode(Append).save(basePath)
dfUpdates.coalesce(1).write.format("org.apache.hudi").option("hoodie.insert.shuffle.parallelism",
"2").option("hoodie.upsert.shuffle.parallelism",
"2").option("hoodie.cleaner.commits.retained",
"3").option("hoodie.cleaner.fileversions.retained",
"2").option("hoodie.compact.inline",
"true").option("hoodie.compact.inline.max.delta.commits",
"2").option(OPERATION_OPT_KEY,
UPSERT_OPERATION_OPT_VAL).option(TABLE_TYPE_OPT_KEY, MOR_TABLE_TYPE_OPT_VAL)
.option(RECORDKEY_FIELD_OPT_KEY, "uuid").option(PARTITIONPATH_FIELD_OPT_KEY,
"partitionpath").option(PRECOMBINE_FIELD_OPT_KEY, "ts").option(TABLE_NAME,
tableName).mode(Append).save(basePath)
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]