[GitHub] [hudi] FelixKJose opened a new issue #3676: MOR table rolls out new parquet files at 10MB for new inserts - even though max file size set as 128MB

GitBox Thu, 16 Sep 2021 08:29:13 -0700


FelixKJose opened a new issue #3676:
URL: https://github.com/apache/hudi/issues/3676



   I have been trying to set up data in HUDI MOR table for my consumption load 
test, but come across an interesting behavior. I have configured the write 
operation as 'upsert', even though all my writes are 'insert'. When I do 
perform new insert, I could see new version of the base parquet file is getting 
created by appending new data. 
   As my commits to retained configuration is set to 1, it keeps only one old 
version of the base file. Everything was working as expected until it starts 
creating new parquet files once the current partquet file reaches ~10 MB even 
though my max parquet file size is set to 128MB and small file size as default 
(100MB).
   
   Following is my configuration:
           "hoodie.datasource.write.table.type": "MERGE_ON_READ",
           "hoodie.datasource.write.precombine.field": "eventDateTime",
           "hoodie.datasource.write.hive_style_partitioning": "true",
           "hoodie.datasource.write.streaming.retry.count": 3,
           "hoodie.datasource.write.streaming.retry.interval.ms": 2000,
           "hoodie.datasource.write.streaming.ignore.failed.batch": "false",
           "hoodie.payload.ordering.field": "eventDateTime",
           "hoodie.datasource.write.payload.class": 
"org.apache.hudi.common.model.DefaultHoodieRecordPayload",
           "hoodie.upsert.shuffle.parallelism": 1,
           "hoodie.insert.shuffle.parallelism": 1,
           "hoodie.consistency.check.enabled": "false",
           "hoodie.index.type": "BLOOM",
           "hoodie.bloom.index.filter.type": "DYNAMIC_V0",
           "hoodie.index.bloom.num_entries": 60000,
           "hoodie.index.bloom.fpp": 1e-09,
           "hoodie.parquet.max.file.size": "134217728",
           "hoodie.parquet.block.size": "134217728",
           "hoodie.parquet.page.size": "1048576",
           # "hoodie.datasource.compaction.async.enable": True,
           "hoodie.compact.inline": True,
           # "hoodie.clean.async": True,
           # 'hoodie.clean.automatic': True,
           'hoodie.cleaner.commits.retained': 1,
           "hoodie.keep.min.commits": 2,
           "hoodie.compact.inline.max.delta.commits": 2,
           "hoodie.table.name": "flattened_calculations_mor",
           "hoodie.datasource.write.recordkey.field": "identifier",
           "hoodie.datasource.hive_sync.table": "flattened_calculations_mor",
           "hoodie.upsert.shuffle.parallelism": 5,
           "hoodie.insert.shuffle.parallelism": 5,
           "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.NonpartitionedKeyGenerator",
           "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.NonPartitionedExtractor"
       
           spark_session = (
               SparkSession.builder.appName("Data_Bulk_ingestion_Job")
               .config("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
               .config("spark.sql.hive.convertMetastoreParquet", "false")
               .config(
                   "spark.jars.packages",
                   
"org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.2",
               )
               .getOrCreate()
           )
       Following is the screenshot of the fioles generated:
   <img width="1041" alt="Screen Shot 2021-09-16 at 11 25 35 AM" 
src="https://user-images.githubusercontent.com/22526075/133640249-5e0a51d2-6e28-4b05-a547-64f3a183decd.png";>
       Following is the .hoodie folder content:
   [hoodie.zip](https://github.com/apache/hudi/files/7179089/hoodie.zip)
   
   
   @nsivabalan
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] FelixKJose opened a new issue #3676: MOR table rolls out new parquet files at 10MB for new inserts - even though max file size set as 128MB

Reply via email to