Hi,
I am running some PIG queries atop Tez atop Yarn. My PIG query has a large 
stage which reads 45 GB data, and outputs less than 1 MB.  The stage is 
processed by 200 tasks, on 9 machines cluster with up to 8 tasks running in 
parallel, each with 7 GB memory. 

I am monitoring the resource usage by the job and I observe that the stage 
writes 53 GB data to disk, which makes me to be confused as the intermediate 
data size is less than 1 MB. 

Do you guys have any idea what might be the reason ? It is possible that the 
processing code in the tasks to actually write data to disk as part of the 
processing phase ?
Thank you,Robert

(PS: I am looking at IOSTAT counters, namely MB read and write)

Reply via email to