fanyue-xia commented on PR #45971:
URL: https://github.com/apache/spark/pull/45971#issuecomment-2052224756

   > Thanks for the effort! This really requires some deep understanding of 
spark internals...
   > 
   > There is still one important concern, that the golden file size is too 
big. I looked a bit, it seems that the largest golden file is ~7MB. We should 
find a way to limit the file size to < 10MB.
   > 
   > One improvement I can see is that, here you are storing both the rows and 
partition ids, but we don't need to store rows.
   > 
   > Instead, we store the random seed, and regenerate the random rows in the 
check. By doing this we only need to store the seed, the schemas, and for each 
schema:
   > 
   > 1. partition ids, and
   > 2. numRows
   > 
   > Now golden file size should be much smaller.
   > 
   > This means that we trust `RandomDataGenerator` to generate the same row 
for the same seed everytime. The code hasn't been touched like 10 years so I 
think this should be safe. Although unlikely, when people really need to touch 
that later, they'll notice this test failure.
   
   I have concerned that the seed might behave differently across runs/on 
different machines. Talked to @HeartSaVioR  about it, he mentioned that he 
isn't sure whether it is the case; it’s safer to just store the generated 
inputs; if there is any difference in any random generation, it’s going to be 
uneasy to find.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to