Hi all:
 i want to ask a question  about how to estimate the rdd size( according to 
byte) when it is not saved to disk because the job spends long time if the 
output is very huge and output partition number is small. 




following step is  what i can solve for this problem 

 1.sample 0.01 's original data

 2.compute sample data count

 3. if sample data count >0, cache the sample data  and compute sample data size

 4.compute original rdd total count

 5.estimate the rdd size as ${total count}* ${sampel data size}  / ${sample rdd 
count}



The code is here.  


My question
1. can i use above way to solve the problem?   If can not, where is wrong?
2. Is there any existed solution ( existed API in spark) to solve the problem?






Best Regards
Kelly Zhang

Reply via email to