Hi all: i want to ask a question about how to estimate the rdd size( according to byte) when it is not saved to disk because the job spends long time if the output is very huge and output partition number is small.
following step is what i can solve for this problem 1.sample 0.01 's original data 2.compute sample data count 3. if sample data count >0, cache the sample data and compute sample data size 4.compute original rdd total count 5.estimate the rdd size as ${total count}* ${sampel data size} / ${sample rdd count} The code is here. My question 1. can i use above way to solve the problem? If can not, where is wrong? 2. Is there any existed solution ( existed API in spark) to solve the problem? Best Regards Kelly Zhang