vinothchandar commented on issue #1491: [SUPPORT] OutOfMemoryError during 
upsert 53M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-610504319
 
 
   Is this real data or can you share a reproducible snippet of code? 
Especially with these local microbenchmarks, its useful to understand as the 
small costs that typically don't matter in real cluster, kind of tend to 
amplify.. 
   
   From the logs, it seems like 
   1) bulk_insert is succeeding and upsert is what's failing... and it's 
failing during the write phase, when we actually allocate some memory to do the 
merge.. 
   
   
   2) From the logs below, it seems like you have a lot of data potentially for 
a single node.. How much total data do you have in those 53M records? (That's a 
key metric for runtime, more than number of records. Hudi does not have a 
maxiumum records limit etc per se)
   
   ```
   20/04/07 08:02:55 INFO ExternalAppendOnlyMap: Thread 136 spilling in-memory 
map of 1325.4 MB to disk (1 time so far)
   20/04/07 08:03:04 INFO ExternalAppendOnlyMap: Thread 137 spilling in-memory 
map of 1329.9 MB to disk (1 time so far)
   20/04/07 08:03:04 INFO ExternalAppendOnlyMap: Thread 135 spilling in-memory 
map of 1325.7 MB to disk (1 time so far)
   20/04/07 08:03:07 INFO ExternalAppendOnlyMap: Thread 47 spilling in-memory 
map of 1385.6 MB to disk (1 time so far)
   20/04/07 08:03:25 INFO ExternalAppendOnlyMap: Thread 136 spilling in-memory 
map of 1325.4 MB to disk (2 times so far)
   20/04/07 08:03:41 INFO ExternalAppendOnlyMap: Thread 137 spilling in-memory 
map of 1325.5 MB to disk (2 times so far)
   20/04/07 08:03:43 INFO ExternalAppendOnlyMap: Thread 135 spilling in-memory 
map of 1325.4 MB to disk (2 times so far)
   20/04/07 08:03:58 INFO ExternalAppendOnlyMap: Thread 47 spilling in-memory 
map of 1381.4 MB to disk (2 times so far)
   20/04/07 08:04:08 INFO ExternalAppendOnlyMap: Thread 136 spilling in-memory 
map of 1325.4 MB to disk (3 times so far)
   20/04/07 08:04:24 INFO ExternalAppendOnlyMap: Thread 137 spilling in-memory 
map of 1325.4 MB to disk (3 times so far)
   20/04/07 08:04:28 INFO ExternalAppendOnlyMap: Thread 135 spilling in-memory 
map of 1327.7 MB to disk (3 times so far)
   20/04/07 08:04:57 INFO ExternalAppendOnlyMap: Thread 136 spilling in-memory 
map of 1325.4 MB to disk (4 times so far)
   20/04/07 08:04:59 INFO ExternalAppendOnlyMap: Thread 47 spilling in-memory 
map of 1491.8 MB to disk (3 times so far)
   20/04/07 08:05:14 INFO ExternalAppendOnlyMap: Thread 137 spilling in-memory 
map of 1363.9 MB to disk (4 times so far)
   20/04/07 08:05:16 INFO ExternalAppendOnlyMap: Thread 135 spilling in-memory 
map of 1325.4 MB to disk (4 times so far)
   20/04/07 08:05:47 INFO ExternalAppendOnlyMap: Thread 47 spilling in-memory 
map of 1349.8 MB to disk (4 times so far)
   20/04/07 08:06:05 INFO ExternalAppendOnlyMap: Thread 137 spilling in-memory 
map of 1300.9 MB to disk (5 times so far)
   ```
   
   I suspect what's happening is that spark memory is actually full (Hudi 
caches input to derive workload profile etc and typically advised to keep input 
data in memory) and it keeps spilling to disk, slowing everything down.. (more 
of a spark tuning thing)... But things don't break until Hudi tries to allocate 
some memory on its own, at which point the heap is full.. 
   
   Can you give this a shot on a cluster?
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to