rubenssoto opened a new issue #1893:
URL: https://github.com/apache/hudi/issues/1893
Hi, how are you?
I'm trying to create a hudi dataset from a small dataset, 2gb. I made an
insert operation and because of dataset is so small I don't want a partition, I
want 2 files of 1gb each but hudi is creating a lot of 30mb files.
**This is my spark submit command**
spark-submit --deploy-mode cluster --conf
spark.dynamicAllocation.minExecutors=4 --conf spark.executor.cores=3 --conf
spark.executor.memoryOverhead=2048 --conf spark.executor.memory=20g --conf
spark.serializer=org.apache.spark.serializer.KryoSerializer --conf
spark.sql.hive.convertMetastoreParquet=false --packages
org.apache.hudi:hudi-spark-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4
--py-files some_python_modules --files some_files main.py parameter
**this is my hudi options**
hudi_options = {
'hoodie.table.name': self.table_name,
'hoodie.datasource.write.recordkey.field':
hudi_config.primary_key_column,
'hoodie.datasource.write.table.name': self.table_name,
'hoodie.datasource.write.operation': hudi_config.write_operation,
'hoodie.combine.before.insert': hudi_config.write_operation ==
'insert' if 'true' else 'false',
'hoodie.combine.before.upsert': hudi_config.write_operation ==
'upsert' if 'true' else 'false',
'hoodie.datasource.write.precombine.field':
hudi_config.precombined_column,
'hoodie.datasource.write.keygenerator.class':
'org.apache.hudi.keygen.NonpartitionedKeyGenerator',
'hoodie.parquet.small.file.limit': 700000000,
'hoodie.parquet.max.file.size': 900000000,
'hoodie.parquet.block.size': 700000000,
'hoodie.cleaner.commits.retained': 1,
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.table': self.table_name,
'hoodie.datasource.hive_sync.database': 'datalake_raw',
'hoodie.datasource.hive_sync.jdbcurl':
'jdbc:hive2://ip-10-0-62-197.us-west-2.compute.internal:10000',
'hoodie.insert.shuffle.parallelism': 2,
'hoodie.upsert.shuffle.parallelism': 2
}
I tried to remove this options
'hoodie.parquet.small.file.limit': 700000000,
'hoodie.parquet.max.file.size': 900000000,
'hoodie.parquet.block.size': 700000000
and hudi creates many files of 4.7mb each
<img width="1680" alt="Captura de Tela 2020-07-31 às 12 18 57"
src="https://user-images.githubusercontent.com/36298331/89049747-156f7500-d328-11ea-94f6-c54f1607844e.png">
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]