abhijeetkushe commented on issue #1737:
URL: https://github.com/apache/hudi/issues/1737#issuecomment-696926015
@n3nash Apologies for the delayed response.I tried a bunch of heuristics
from the available config options for both COW and MOR and I think I got a idea
of how the file creation happens.I am using emr-5.30.1 which hudi
0.5.2-incubating and presto 0.232
I did observe a few things and have a few questions on that
FOR COW table I am writing 100 MB data multiple times using the below
options.
{
'hoodie.table.name': 'click',
'hoodie.datasource.write.recordkey.field':
'campaign_activity_id,contact_id,created_on',
'hoodie.datasource.write.keygenerator.class':
'org.apache.hudi.keygen.ComplexKeyGenerator',
'hoodie.datasource.write.partitionpath.field': 'bucket',
'hoodie.datasource.write.hive_style_partitioning': True,
'hoodie.datasource.write.table.name': 'click',
'hoodie.datasource.write.operation': 'insert',
'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
'hoodie.datasource.write.precombine.field': 'created_on',
'hoodie.parquet.small.file.limit': 25000000, # 25 MB (50 X
0.5 million = 25 MB )
'hoodie.copyonwrite.insert.split.size': 500000, # 0.5 million
'hoodie.copyonwrite.record.size.estimate': 50, # 50 bytes
per record
'hoodie.parquet.max.file.size': 50000000, # 50MB
'hoodie.parquet.block.size': 50000000,
'hoodie.copyonwrite.insert.auto.split': False,
#"hoodie.embed.timeline.server": False,
'hoodie.clean.automatic': True,
'hoodie.clean.async': False,
'hoodie.cleaner.commits.retained': 1,
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism': 2,
'hoodie.datasource.hive_sync.partition_fields': 'bucket',
'hoodie.datasource.hive_sync.enable': True,
'hoodie.datasource.hive_sync.table': 'click',
'hoodie.datasource.hive_sync.partition_extractor_class':
'org.apache.hudi.hive.MultiPartKeysValueExtractor'
}
I ran into Invalid Parquet issue after the 3rd write
https://github.com/prestodb/presto/issues/13457 which will be fixed in a later
version of presto.But I noticed that there were files being created larger than
50MB which is different from max file specified above (Snapshots below). I
noticed the same behavior for MOR for which I believe the problem ought to be
the same since this is parquet file
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]