fanyue-xia commented on PR #45971: URL: https://github.com/apache/spark/pull/45971#issuecomment-2052224756
> Thanks for the effort! This really requires some deep understanding of spark internals... > > There is still one important concern, that the golden file size is too big. I looked a bit, it seems that the largest golden file is ~7MB. We should find a way to limit the file size to < 10MB. > > One improvement I can see is that, here you are storing both the rows and partition ids, but we don't need to store rows. > > Instead, we store the random seed, and regenerate the random rows in the check. By doing this we only need to store the seed, the schemas, and for each schema: > > 1. partition ids, and > 2. numRows > > Now golden file size should be much smaller. > > This means that we trust `RandomDataGenerator` to generate the same row for the same seed everytime. The code hasn't been touched like 10 years so I think this should be safe. Although unlikely, when people really need to touch that later, they'll notice this test failure. I have concerned that the seed might behave differently across runs/on different machines. Talked to @HeartSaVioR about it, he mentioned that he isn't sure whether it is the case; it’s safer to just store the generated inputs; if there is any difference in any random generation, it’s going to be uneasy to find. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org