[GitHub] [hudi] rubenssoto opened a new issue #1893: [SUPPORT] Hudi is creating a lot of small files

GitBox Fri, 31 Jul 2020 08:20:13 -0700


rubenssoto opened a new issue #1893:
URL: https://github.com/apache/hudi/issues/1893



   Hi, how are you?
   
   I'm trying to create a hudi dataset from a small dataset, 2gb. I made an 
insert operation and because of dataset is so small I don't want a partition, I 
want 2 files of 1gb each but hudi is creating a lot of 30mb files.
   
   **This is my spark submit command**
   
   spark-submit --deploy-mode cluster --conf 
spark.dynamicAllocation.minExecutors=4 --conf spark.executor.cores=3 --conf 
spark.executor.memoryOverhead=2048 --conf spark.executor.memory=20g --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
spark.sql.hive.convertMetastoreParquet=false --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4
 --py-files some_python_modules --files some_files main.py parameter
   
   **this is my hudi options**
   
   hudi_options = {
               'hoodie.table.name': self.table_name,
               'hoodie.datasource.write.recordkey.field': 
hudi_config.primary_key_column,
               'hoodie.datasource.write.table.name': self.table_name,
               'hoodie.datasource.write.operation': hudi_config.write_operation,
               'hoodie.combine.before.insert': hudi_config.write_operation == 
'insert' if 'true' else 'false',
               'hoodie.combine.before.upsert': hudi_config.write_operation == 
'upsert' if 'true' else 'false',
               'hoodie.datasource.write.precombine.field': 
hudi_config.precombined_column,
               'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.NonpartitionedKeyGenerator',
               'hoodie.parquet.small.file.limit': 700000000,
               'hoodie.parquet.max.file.size': 900000000,
               'hoodie.parquet.block.size': 700000000,
               'hoodie.cleaner.commits.retained': 1,
               'hoodie.datasource.hive_sync.enable': 'true',
               'hoodie.datasource.hive_sync.table': self.table_name,
               'hoodie.datasource.hive_sync.database': 'datalake_raw',
               'hoodie.datasource.hive_sync.jdbcurl': 
'jdbc:hive2://ip-10-0-62-197.us-west-2.compute.internal:10000',
               'hoodie.insert.shuffle.parallelism': 2,
               'hoodie.upsert.shuffle.parallelism': 2
           }
   
   I tried to remove this options
   'hoodie.parquet.small.file.limit': 700000000,
    'hoodie.parquet.max.file.size': 900000000,
    'hoodie.parquet.block.size': 700000000
   
   and hudi creates many files of 4.7mb each
   
   <img width="1680" alt="Captura de Tela 2020-07-31 às 12 18 57" 
src="https://user-images.githubusercontent.com/36298331/89049747-156f7500-d328-11ea-94f6-c54f1607844e.png";>
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] rubenssoto opened a new issue #1893: [SUPPORT] Hudi is creating a lot of small files

Reply via email to