Hi, I am running some PIG queries atop Tez atop Yarn. My PIG query has a large stage which reads 45 GB data, and outputs less than 1 MB. The stage is processed by 200 tasks, on 9 machines cluster with up to 8 tasks running in parallel, each with 7 GB memory.
I am monitoring the resource usage by the job and I observe that the stage writes 53 GB data to disk, which makes me to be confused as the intermediate data size is less than 1 MB. Do you guys have any idea what might be the reason ? It is possible that the processing code in the tasks to actually write data to disk as part of the processing phase ? Thank you,Robert (PS: I am looking at IOSTAT counters, namely MB read and write)