Hi, I think part of the update process is work slow and I'm doing something wrong. Example work MOR at - https://drive.google.com/open?id=17YP_V5k-g3Rp6-jaWaTWvKUSBTOPHg4g as it seems to me a long time trying to update the existing records, and I don't understand why it takes most of the time to work " BoundedInMemoryQueue" (more than 1.5 hours to overwrite 15 GB and 625 files) I use Spark - version 2.3.0.cloudera3 and try apply 100mln records(15GB) for snapshot 1 billion records (1ТБ ~8k files) I am so appreciate if anyone can help me locate this problem.
Logs, executor 19/05/29 12:47:09 INFO storage.ShuffleBlockFetcherIterator: Started 269 remote fetches in 109 ms 19/05/29 12:47:09 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://nameservice1], Config:[Configuration: ], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1394640145_127, ugi=mradionov (auth:SIMPLE)]]] 19/05/29 12:47:09 INFO io.HoodieMergeHandle: MaxMemoryPerPartitionMerge => 0 19/05/29 12:47:09 INFO collection.DiskBasedMap: Spilling to file location /tmp/91e5578f-6e25-476d-8c63-15834c7588f9 in host (1.1.1.1) with hostname (host) 19/05/29 12:47:23 INFO io.HoodieMergeHandle: Number of entries in MemoryBasedMap => 0Total size in bytes of MemoryBasedMap => 0Number of entries in DiskBasedMap => 125849Size of file spilled to disk => 75107012 19/05/29 12:47:25 INFO io.HoodieMergeHandle: Merging new data into oldPath /init_1000mln/default/a46c60f1-63bf-4b8d-b24e-6a6bdb36dad9_2593_20190527161512.parquet, as newPath /init_1000mln/default/a46c60f1-63bf-4b8d-b24e-6a6bdb36dad9_461_20190529115632.parquet 19/05/29 12:47:25 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://nameservice1], Config:[Configuration: ], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1394640145_127, ugi=(auth:SIMPLE)]]] 19/05/29 12:47:25 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://nameservice1], Config:[Configuration: ], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1394640145_127, ugi=(auth:SIMPLE)]]] 19/05/29 12:47:25 INFO compress.CodecPool: Got brand-new compressor [.gz] 19/05/29 12:47:25 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://nameservice1], Config:[Configuration: ], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1394640145_127, ugi=(auth:SIMPLE)]]] 19/05/29 12:47:25 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://nameservice1], Config:[Configuration: ], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1394640145_127, ugi=(auth:SIMPLE)]]] 19/05/29 12:47:25 INFO queue.BoundedInMemoryExecutor: starting consumer thread 19/05/29 *12:47:25* INFO queue.IteratorBasedQueueProducer: starting to buffer records 19/05/29 *13:28:44* INFO queue.IteratorBasedQueueProducer: finished buffering records 19/05/29 13:28:44 INFO queue.BoundedInMemoryExecutor: Queue Consumption is done; notifying producer threads Best Regards Maksim Radionov
