Hi Maksim,
It looks like the spark and hudi memory settings ( spark.executor.memory,
spark.memory.fraction, hudi.memory.merge.fraction) may not have been
configured correctly to let Hudi use memory for merging. With the current
settings you have, Hudi has no memory to use for Merge process and is resorting
to disk based merging which will be slow but would progress without OOM. You
would need to check your configs.
The logic for calculating memory is in :
https://github.com/apache/incubator-hudi/blob/acd74129cd97f24c0dde9bf032a4048f2ce27b5f/hoodie-client/src/main/java/com/uber/hoodie/config/HoodieMemoryConfig.java#L117
Balaji.V
On Friday, May 31, 2019, 2:53:56 AM PDT, Максим Радионов
<[email protected]> wrote:
Hi,
I think part of the update process is work slow and I'm doing something
wrong. Example work MOR at -
https://drive.google.com/open?id=17YP_V5k-g3Rp6-jaWaTWvKUSBTOPHg4g
as it seems to me a long time trying to update the existing records, and I
don't understand why it takes most of the time to work "
BoundedInMemoryQueue" (more than 1.5 hours to overwrite 15 GB and 625 files)
I use Spark - version 2.3.0.cloudera3 and try apply 100mln records(15GB)
for snapshot 1 billion records (1ТБ ~8k files)
I am so appreciate if anyone can help me locate this problem.
Logs, executor
19/05/29 12:47:09 INFO storage.ShuffleBlockFetcherIterator: Started
269 remote fetches in 109 ms
19/05/29 12:47:09 INFO util.FSUtils: Hadoop Configuration:
fs.defaultFS: [hdfs://nameservice1], Config:[Configuration: ],
FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1394640145_127,
ugi=mradionov (auth:SIMPLE)]]]
19/05/29 12:47:09 INFO io.HoodieMergeHandle: MaxMemoryPerPartitionMerge => 0
19/05/29 12:47:09 INFO collection.DiskBasedMap: Spilling to file
location /tmp/91e5578f-6e25-476d-8c63-15834c7588f9 in host (1.1.1.1)
with hostname (host)
19/05/29 12:47:23 INFO io.HoodieMergeHandle: Number of entries in
MemoryBasedMap => 0Total size in bytes of MemoryBasedMap => 0Number of
entries in DiskBasedMap => 125849Size of file spilled to disk =>
75107012
19/05/29 12:47:25 INFO io.HoodieMergeHandle: Merging new data into
oldPath
/init_1000mln/default/a46c60f1-63bf-4b8d-b24e-6a6bdb36dad9_2593_20190527161512.parquet,
as newPath
/init_1000mln/default/a46c60f1-63bf-4b8d-b24e-6a6bdb36dad9_461_20190529115632.parquet
19/05/29 12:47:25 INFO util.FSUtils: Hadoop Configuration:
fs.defaultFS: [hdfs://nameservice1], Config:[Configuration: ],
FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1394640145_127,
ugi=(auth:SIMPLE)]]]
19/05/29 12:47:25 INFO util.FSUtils: Hadoop Configuration:
fs.defaultFS: [hdfs://nameservice1], Config:[Configuration: ],
FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1394640145_127,
ugi=(auth:SIMPLE)]]]
19/05/29 12:47:25 INFO compress.CodecPool: Got brand-new compressor [.gz]
19/05/29 12:47:25 INFO util.FSUtils: Hadoop Configuration:
fs.defaultFS: [hdfs://nameservice1], Config:[Configuration: ],
FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1394640145_127,
ugi=(auth:SIMPLE)]]]
19/05/29 12:47:25 INFO util.FSUtils: Hadoop Configuration:
fs.defaultFS: [hdfs://nameservice1], Config:[Configuration: ],
FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1394640145_127,
ugi=(auth:SIMPLE)]]]
19/05/29 12:47:25 INFO queue.BoundedInMemoryExecutor: starting consumer thread
19/05/29 *12:47:25* INFO queue.IteratorBasedQueueProducer: starting to
buffer records
19/05/29 *13:28:44* INFO queue.IteratorBasedQueueProducer: finished
buffering records
19/05/29 13:28:44 INFO queue.BoundedInMemoryExecutor: Queue
Consumption is done; notifying producer threads
Best Regards
Maksim Radionov