santas-little-helper-13 opened a new issue #2372:
URL: https://github.com/apache/hudi/issues/2372


   Hi,
   
   I'm working with Hudi in AWS Glue job. I have been testing it to see if it 
works with large files (more than 1 GB). I've copied one json file multiple 
times to have larger data in the bucket.
   
   1 GB, 3 GB and 6 GB files run successfully in Glue job, but 12 GB fails 
after 2.5 h with error: 
   _Failed to upsert for commit time 20201221141952; 
   Caused by: org.apache.spark.SparkException: Job aborted due to stage 
failure: ShuffleMapStage 6 (countByKey at HoodieBloomIndex.java:141) has failed 
the maximum allowable number of times: 4. Most recent failure reason: 
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 2_ 
   
   These are Hudi options:
   
   ```
   hoodie_write_options = {
        'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
        'hoodie.parquet.compression.codec': 'snappy',
        'hoodie.table.name': table_name,
        'hoodie.datasource.write.recordkey.field': 'line_no',
        'hoodie.datasource.write.table.name': table_name,
        'hoodie.datasource.write.operation': 'upsert',
        'hoodie.datasource.write.precombine.field': 'metadata_key',
        'hoodie.datasource.write.insert.drop.duplicates': False,
        'hoodie.upsert.shuffle.parallelism': 8,
        'hoodie.insert.shuffle.parallelism': 8,
        'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.ComplexKeyGenerator'}
   ```
   
   I have tried adding `'hoodie.clean.automatic': False` but it didn't work. 
   
   I also tried increasing DPU from 10 to 50 and 200, also didn't work.
   
   These are the execution times for every size that I have tried to run:
   
   Size | Execution time
   ---: | ---:
   1 gb | 3 min
   3 gb | 12 min
   6 gb | 26 min 
   
   Does anybody know what I can do to run this successfully?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to