Hi all, I got a simple processing job for 20000 accounts on 8 partitions. It's roughly 2500 accounts on each partition. Each account will take about 1s to complete the computation. That means each partition will take about 2500 seconds to finish the batch.
My question is how can I get the detailed progress of how many accounts has been processed for each partition during the computation. An ideal solution would allow me to know how many accounts has been processed periodically (like every minute) so I can monitor and take action to save some time. Right now on UI I can only get that task is running. I know one solution is to split the data horizontally on driver and submit to spark in mini batches, yet I think that would waste some cluster resource and create extra complexity for result handling. Any experience or best practice is welcome. Thanks a lot. Regards, Yuhao