HeartSaVioR commented on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-635882784
I've spent some time to experiment more approaches. This is the experiment branch: https://github.com/HeartSaVioR/spark/tree/SPARK-30946-experiments > Version 3 is only applying compaction (LZ4) on existing format. See below commit: https://github.com/HeartSaVioR/spark/commit/406670aa4910c4bec847a590ef37f2f0bd130902 > Version 4 is serializing/deserializing entry via DataInputStream / DataOutputStream https://github.com/HeartSaVioR/spark/commit/7c665163bdd1930fb8812d46a7f8cdd599b1cafb I've also implemented simple apps to 1) prepare metadata (so that we can experiment on the specific batch) and 2) run simple test with various versions: https://github.com/HeartSaVioR/spark-delegation-token-experiment/commit/bea7680e4c588f455f8c3181a96c9eff5002fa1a The numbers are recorded below: https://docs.google.com/spreadsheets/d/1D5P103F_sKOjkDpNr9PaCC8Ehk4Y4dRtH3oEdytM4_c/edit?usp=sharing version | elapsed time | elapsed time (ratio of v1) | size | size (ratio of v1) ------- | -------------- | ------------------------- | ----- | ----------------- 1 | 10628.75 | 100.00% | 57265744 | 100.00% 2 | 939.25 | 8.84% | 16655736 | 29.08% 3 | 10116 | 95.18% | 17259852 | 30.14% 4 | 837 | 7.87% | 15285626 | 26.69% The number represents that applying compression on existing format doesn't help reducing the time, while the size is reduced similar with other alternatives. Other alternatives directly integrated to the data structure greatly reduce the time, say, 10 times faster. The size of compact files are similar across alternatives. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
