mzheng-plaid commented on issue #9934:
URL: https://github.com/apache/hudi/issues/9934#issuecomment-1789429503

   We tried doubling our driver memory and are hitting the same error, 
according to 
https://docs.oracle.com/en/java/javase/12/troubleshoot/troubleshoot-memory-leaks.html#GUID-19F6D28E-75A1-4480-9879-D0932B2F305B
 it seems like the array is larger than the maximum limit
   
   ```
   "spark.driver.memory": "60g",
   "spark.driver.memoryOverhead": "16g"
   ```
   
   The table is ~70TB and has ~ 200 billion rows. The average file size for 
base parquet files is ~ 128mb. The failed write is ~ 60 million rows.
   
   We are running an ingestion workload that touches many files in many 
partitions - that is unfortunate but expected, eg. an example 
`partitionToWriteStats` for a successful `deltacommit`
   ```
   "dt=2021-03-26" : [ {
         "fileId" : "d12d3339-b787-4559-810f-cbafd1c6d7f8-0",
         "path" : 
"dt=2021-03-26/.d12d3339-b787-4559-810f-cbafd1c6d7f8-0_20231020055425697.log.9_0-259-3462231",
         "prevCommit" : "20231020055425697",
         "numWrites" : 1,
         "numDeletes" : 8,
         "numUpdateWrites" : 1,
         "numInserts" : 0,
         "totalWriteBytes" : 23855,
         "totalWriteErrors" : 0,
         "tempPath" : null,
         "partitionPath" : "dt=2021-03-26",
         "totalLogRecords" : 0,
         "totalLogFilesCompacted" : 0,
         "totalLogSizeCompacted" : 0,
         "totalUpdatedRecordsCompacted" : 0,
         "totalLogBlocks" : 0,
         "totalCorruptLogBlock" : 0,
         "totalRollbackBlocks" : 0,
         "fileSizeInBytes" : 23855,
         "minEventTime" : null,
         "maxEventTime" : null,
         "runtimeStats" : {
           "totalScanTime" : 0,
           "totalUpsertTime" : 1576,
           "totalCreateTime" : 0
         },
         "logVersion" : 9,
         "logOffset" : 0,
         "baseFile" : 
"d12d3339-b787-4559-810f-cbafd1c6d7f8-0_106-7549-12992526_20231020055425697.parquet",
         "logFiles" : [ 
".d12d3339-b787-4559-810f-cbafd1c6d7f8-0_20231020055425697.log.8_0-203-2649793",
 
".d12d3339-b787-4559-810f-cbafd1c6d7f8-0_20231020055425697.log.7_0-149-1836792",
 
".d12d3339-b787-4559-810f-cbafd1c6d7f8-0_20231020055425697.log.6_0-101-1549604",
 ".d12d3339-b787-4559-810f-cbafd1c6d7f8-0_20231020055425697.log.5_0-94-938907", 
".d12d3339-b787-4559-810f-cbafd1c6d7f8-0_20231020055425697.log.4_0-39-116221", 
".d12d3339-b787-4559-810f-cbafd1c6d7f8-0_20231020055425697.log.3_0-106-1428581",
 ".d12d3339-b787-4559-810f-cbafd1c6d7f8-0_20231020055425697.log.2_0-99-998265", 
".d12d3339-b787-4559-810f-cbafd1c6d7f8-0_20231020055425697.log.1_0-39-82846", 
".d12d3339-b787-4559-810f-cbafd1c6d7f8-0_20231020055425697.log.9_0-259-3462231" 
],
         "recordsStats" : {
           "val" : null
         }
       }
   ```
   
   We've sized our Spark job such that the ingestion itself works fine - it 
seems like Hudi is able to write the data but unable to commit because its 
trying to allocate an array thats larger than the heap. We can make our 
ingestion batch size smaller but this leads to much worse performance because 
of higher write amplification and more overhead with index lookup step.
   
   We're considering clustering our data from 128mb to ~ 500mb to see if this 
helps.
   
   Current writer configurations:
   ```
   'hoodie.compact.inline': True,
   'hoodie.compact.inline.max.delta.commits': 6, 
   'hoodie.cleaner.commits.retained': 1, 
   'hoodie.compaction.target.io': 30000000, 
   'hoodie.parquet.max.file.size': 536870912, 
   'hoodie.parquet.block.size': 536870912, 
   'hoodie.parquet.small.file.limit': 268435456, 
   'hoodie.bloom.index.prune.by.ranges': 'false', 
   'hoodie.upsert.shuffle.parallelism': 9000, 
   'hoodie.datasource.write.payload.class': 
'org.apache.hudi.common.model.DefaultHoodieRecordPayload', 
'hoodie.compaction.payload.class': 
'org.apache.hudi.common.model.DefaultHoodieRecordPayload', 
'hoodie.rollback.parallelism': 500, 
   'hoodie.commits.archival.batch': 5
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to