santas-little-helper-13 opened a new issue #2372:
URL: https://github.com/apache/hudi/issues/2372
Hi,
I'm working with Hudi in AWS Glue job. I have been testing it to see if it
works with large files (more than 1 GB). I've copied one json file multiple
times to have larger data in the bucket.
1 GB, 3 GB and 6 GB files run successfully in Glue job, but 12 GB fails
after 2.5 h with error:
_Failed to upsert for commit time 20201221141952;
Caused by: org.apache.spark.SparkException: Job aborted due to stage
failure: ShuffleMapStage 6 (countByKey at HoodieBloomIndex.java:141) has failed
the maximum allowable number of times: 4. Most recent failure reason:
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 2_
These are Hudi options:
```
hoodie_write_options = {
'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
'hoodie.parquet.compression.codec': 'snappy',
'hoodie.table.name': table_name,
'hoodie.datasource.write.recordkey.field': 'line_no',
'hoodie.datasource.write.table.name': table_name,
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'metadata_key',
'hoodie.datasource.write.insert.drop.duplicates': False,
'hoodie.upsert.shuffle.parallelism': 8,
'hoodie.insert.shuffle.parallelism': 8,
'hoodie.datasource.write.keygenerator.class':
'org.apache.hudi.keygen.ComplexKeyGenerator'}
```
I have tried adding `'hoodie.clean.automatic': False` but it didn't work.
I also tried increasing DPU from 10 to 50 and 200, also didn't work.
These are the execution times for every size that I have tried to run:
Size | Execution time
---: | ---:
1 gb | 3 min
3 gb | 12 min
6 gb | 26 min
Does anybody know what I can do to run this successfully?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]