Hi all,

I got a simple processing job for 20000 accounts on 8 partitions. It's
roughly 2500 accounts on each partition. Each account will take about 1s to
complete the computation. That means each partition will take about 2500
seconds to finish the batch.

My question is how can I get the detailed progress of how many accounts has
been processed for each partition during the computation. An ideal solution
would allow me to know how many accounts has been processed periodically
(like every minute) so I can monitor and take action to save some time.
Right now on UI I can only get that task is running.

I know one solution is to split the data horizontally on driver and submit
to spark in mini batches, yet I think that would waste some cluster
resource and create extra complexity for result handling.

Any experience or best practice is welcome. Thanks a lot.

Regards,
Yuhao

Reply via email to