rubenssoto opened a new issue #1902:
URL: https://github.com/apache/hudi/issues/1902


   Hi Guys,
   
   I have a small dataset 7 gb, because of that, I didn't partition the data, I 
prefer create some big files, so Hudi created 13 files of 700mb each. My 
dataset has a auto increment id, its a table primary key.
   To load old data I made a batch operation, and Hudi put the data of one day 
in 10 different files.
   Hudi do some sort based on primary key column or I have to explicit do a 
sort operation?
   
   For example, I made this query:
   select _hoodie_file_name,count(1) from "order"
   where created_date_brt = '2020-07-01'
   group by _hoodie_file_name
   order by _hoodie_file_name
   
   
   Result:
   <img width="1258" alt="Captura de Tela 2020-08-02 às 19 27 53" 
src="https://user-images.githubusercontent.com/36298331/89133884-57e6ac80-d4f6-11ea-88de-a6a1f9fa80c0.png";>
   
   My hudi config:
   hudi_options = {
               'hoodie.table.name': table_name,
               'hoodie.datasource.write.recordkey.field': 
hudi_config.primary_key_column,
               'hoodie.datasource.write.table.name': table_name,
               'hoodie.datasource.write.operation': hudi_config.write_operation,
               'hoodie.combine.before.insert': hudi_config.write_operation in 
['insert','bulkinsert'] if 'true' else 'false',
               'hoodie.combine.before.upsert': hudi_config.write_operation == 
'upsert' if 'true' else 'false',
               'hoodie.datasource.write.precombine.field': 
hudi_config.precombined_column,
               'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.NonpartitionedKeyGenerator',
               'hoodie.parquet.small.file.limit': 800000000,
               'hoodie.parquet.max.file.size': 900000000,
               'hoodie.parquet.block.size': 800000000,
               'hoodie.copyonwrite.record.size.estimate': 30,
               'hoodie.cleaner.commits.retained': 1,
               'hoodie.datasource.hive_sync.enable': 'true',
               'hoodie.datasource.hive_sync.table': table_name,
               'hoodie.datasource.hive_sync.database': 'datalake_raw',
               'hoodie.datasource.hive_sync.jdbcurl': 
'jdbc:hive2://ip-10-0-62-197.us-west-2.compute.internal:10000'
           }
   
   I tried with insert and bulk operation and had the same result.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to