Re: slow work of BoundedInMemoryQueue

[email protected] Fri, 31 May 2019 09:16:42 -0700

 
Hi Maksim,
It looks like the spark and hudi memory settings ( spark.executor.memory, 
spark.memory.fraction, hudi.memory.merge.fraction)  may not have been 
configured correctly to let Hudi use memory for merging. With the current 
settings you have, Hudi has no memory to use for Merge process and is resorting 
to disk based merging which will be slow but would progress without OOM. You 
would need to check your configs. 
The logic for calculating memory is in :
https://github.com/apache/incubator-hudi/blob/acd74129cd97f24c0dde9bf032a4048f2ce27b5f/hoodie-client/src/main/java/com/uber/hoodie/config/HoodieMemoryConfig.java#L117


Balaji.V


    On Friday, May 31, 2019, 2:53:56 AM PDT, Максим Радионов 
<[email protected]> wrote:  
 
 Hi,

I think part of the update process is work slow and I'm doing something
wrong. Example work MOR at -
https://drive.google.com/open?id=17YP_V5k-g3Rp6-jaWaTWvKUSBTOPHg4g
as it seems to me a long time trying to update the existing records, and I
don't understand why it takes most of the time to work  "
BoundedInMemoryQueue" (more than 1.5 hours to overwrite 15 GB and 625 files)
I use Spark - version 2.3.0.cloudera3 and try apply 100mln records(15GB)
for snapshot 1 billion records  (1ТБ ~8k files)
I am so appreciate if anyone can help me locate this problem.

Logs, executor

19/05/29 12:47:09 INFO storage.ShuffleBlockFetcherIterator: Started
269 remote fetches in 109 ms

19/05/29 12:47:09 INFO util.FSUtils: Hadoop Configuration:
fs.defaultFS: [hdfs://nameservice1], Config:[Configuration: ],
FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1394640145_127,
ugi=mradionov (auth:SIMPLE)]]]

19/05/29 12:47:09 INFO io.HoodieMergeHandle: MaxMemoryPerPartitionMerge => 0

19/05/29 12:47:09 INFO collection.DiskBasedMap: Spilling to file
location /tmp/91e5578f-6e25-476d-8c63-15834c7588f9 in host (1.1.1.1)
with hostname (host)

19/05/29 12:47:23 INFO io.HoodieMergeHandle: Number of entries in
MemoryBasedMap => 0Total size in bytes of MemoryBasedMap => 0Number of
entries in DiskBasedMap => 125849Size of file spilled to disk =>
75107012

19/05/29 12:47:25 INFO io.HoodieMergeHandle: Merging new data into
oldPath 
/init_1000mln/default/a46c60f1-63bf-4b8d-b24e-6a6bdb36dad9_2593_20190527161512.parquet,
as newPath 
/init_1000mln/default/a46c60f1-63bf-4b8d-b24e-6a6bdb36dad9_461_20190529115632.parquet

19/05/29 12:47:25 INFO util.FSUtils: Hadoop Configuration:
fs.defaultFS: [hdfs://nameservice1], Config:[Configuration: ],
FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1394640145_127,
ugi=(auth:SIMPLE)]]]

19/05/29 12:47:25 INFO util.FSUtils: Hadoop Configuration:
fs.defaultFS: [hdfs://nameservice1], Config:[Configuration: ],
FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1394640145_127,
ugi=(auth:SIMPLE)]]]

19/05/29 12:47:25 INFO compress.CodecPool: Got brand-new compressor [.gz]

19/05/29 12:47:25 INFO util.FSUtils: Hadoop Configuration:
fs.defaultFS: [hdfs://nameservice1], Config:[Configuration: ],
FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1394640145_127,
ugi=(auth:SIMPLE)]]]

19/05/29 12:47:25 INFO util.FSUtils: Hadoop Configuration:
fs.defaultFS: [hdfs://nameservice1], Config:[Configuration: ],
FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1394640145_127,
ugi=(auth:SIMPLE)]]]

19/05/29 12:47:25 INFO queue.BoundedInMemoryExecutor: starting consumer thread

19/05/29 *12:47:25* INFO queue.IteratorBasedQueueProducer: starting to
buffer records

19/05/29 *13:28:44* INFO queue.IteratorBasedQueueProducer: finished
buffering records

19/05/29 13:28:44 INFO queue.BoundedInMemoryExecutor: Queue
Consumption is done; notifying producer threads



Best Regards
Maksim Radionov

Re: slow work of BoundedInMemoryQueue

Reply via email to