Hi,

I think part of the update process is work slow and I'm doing something
wrong. Example work MOR at -
https://drive.google.com/open?id=17YP_V5k-g3Rp6-jaWaTWvKUSBTOPHg4g
as it seems to me a long time trying to update the existing records, and I
don't understand why it takes most of the time to work  "
BoundedInMemoryQueue" (more than 1.5 hours to overwrite 15 GB and 625 files)
I use Spark - version 2.3.0.cloudera3 and try apply 100mln records(15GB)
for snapshot 1 billion records  (1ТБ ~8k files)
I am so appreciate if anyone can help me locate this problem.

Logs, executor

19/05/29 12:47:09 INFO storage.ShuffleBlockFetcherIterator: Started
269 remote fetches in 109 ms

19/05/29 12:47:09 INFO util.FSUtils: Hadoop Configuration:
fs.defaultFS: [hdfs://nameservice1], Config:[Configuration: ],
FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1394640145_127,
ugi=mradionov (auth:SIMPLE)]]]

19/05/29 12:47:09 INFO io.HoodieMergeHandle: MaxMemoryPerPartitionMerge => 0

19/05/29 12:47:09 INFO collection.DiskBasedMap: Spilling to file
location /tmp/91e5578f-6e25-476d-8c63-15834c7588f9 in host (1.1.1.1)
with hostname (host)

19/05/29 12:47:23 INFO io.HoodieMergeHandle: Number of entries in
MemoryBasedMap => 0Total size in bytes of MemoryBasedMap => 0Number of
entries in DiskBasedMap => 125849Size of file spilled to disk =>
75107012

19/05/29 12:47:25 INFO io.HoodieMergeHandle: Merging new data into
oldPath 
/init_1000mln/default/a46c60f1-63bf-4b8d-b24e-6a6bdb36dad9_2593_20190527161512.parquet,
as newPath 
/init_1000mln/default/a46c60f1-63bf-4b8d-b24e-6a6bdb36dad9_461_20190529115632.parquet

19/05/29 12:47:25 INFO util.FSUtils: Hadoop Configuration:
fs.defaultFS: [hdfs://nameservice1], Config:[Configuration: ],
FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1394640145_127,
ugi=(auth:SIMPLE)]]]

19/05/29 12:47:25 INFO util.FSUtils: Hadoop Configuration:
fs.defaultFS: [hdfs://nameservice1], Config:[Configuration: ],
FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1394640145_127,
ugi=(auth:SIMPLE)]]]

19/05/29 12:47:25 INFO compress.CodecPool: Got brand-new compressor [.gz]

19/05/29 12:47:25 INFO util.FSUtils: Hadoop Configuration:
fs.defaultFS: [hdfs://nameservice1], Config:[Configuration: ],
FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1394640145_127,
ugi=(auth:SIMPLE)]]]

19/05/29 12:47:25 INFO util.FSUtils: Hadoop Configuration:
fs.defaultFS: [hdfs://nameservice1], Config:[Configuration: ],
FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1394640145_127,
ugi=(auth:SIMPLE)]]]

19/05/29 12:47:25 INFO queue.BoundedInMemoryExecutor: starting consumer thread

19/05/29 *12:47:25* INFO queue.IteratorBasedQueueProducer: starting to
buffer records

19/05/29 *13:28:44* INFO queue.IteratorBasedQueueProducer: finished
buffering records

19/05/29 13:28:44 INFO queue.BoundedInMemoryExecutor: Queue
Consumption is done; notifying producer threads



Best Regards
Maksim Radionov

Reply via email to