rubenssoto opened a new issue #1902: URL: https://github.com/apache/hudi/issues/1902
Hi Guys, I have a small dataset 7 gb, because of that, I didn't partition the data, I prefer create some big files, so Hudi created 13 files of 700mb each. My dataset has a auto increment id, its a table primary key. To load old data I made a batch operation, and Hudi put the data of one day in 10 different files. Hudi do some sort based on primary key column or I have to explicit do a sort operation? For example, I made this query: select _hoodie_file_name,count(1) from "order" where created_date_brt = '2020-07-01' group by _hoodie_file_name order by _hoodie_file_name Result: <img width="1258" alt="Captura de Tela 2020-08-02 às 19 27 53" src="https://user-images.githubusercontent.com/36298331/89133884-57e6ac80-d4f6-11ea-88de-a6a1f9fa80c0.png"> My hudi config: hudi_options = { 'hoodie.table.name': table_name, 'hoodie.datasource.write.recordkey.field': hudi_config.primary_key_column, 'hoodie.datasource.write.table.name': table_name, 'hoodie.datasource.write.operation': hudi_config.write_operation, 'hoodie.combine.before.insert': hudi_config.write_operation in ['insert','bulkinsert'] if 'true' else 'false', 'hoodie.combine.before.upsert': hudi_config.write_operation == 'upsert' if 'true' else 'false', 'hoodie.datasource.write.precombine.field': hudi_config.precombined_column, 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.NonpartitionedKeyGenerator', 'hoodie.parquet.small.file.limit': 800000000, 'hoodie.parquet.max.file.size': 900000000, 'hoodie.parquet.block.size': 800000000, 'hoodie.copyonwrite.record.size.estimate': 30, 'hoodie.cleaner.commits.retained': 1, 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.table': table_name, 'hoodie.datasource.hive_sync.database': 'datalake_raw', 'hoodie.datasource.hive_sync.jdbcurl': 'jdbc:hive2://ip-10-0-62-197.us-west-2.compute.internal:10000' } I tried with insert and bulk operation and had the same result. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
