HeartSaVioR opened a new pull request #27694: [SPARK-30946][SS] Serde entry with UnsafeRow on FileStream(Source/Sink)Log with LZ4 compression URL: https://github.com/apache/spark/pull/27694 ### What changes were proposed in this pull request? This patch uses UnsafeRow to serialize/deserialize entry on FileStream(Source/Sink)Log, as well as applying LZ4 compression on serde. The ground idea is that both FileEntry and SinkFileStatus haven't been changed for 3 years but we have been simply taking JSON serde to prepare with possible change and spending bunch of overhead on JSON format. CompactibleFileStreamLog has the metadata version information but it has been just version 1. This patch introduces metadata version 2 of CompactibleFileStreamLog as leveraging UnsafeRow & LZ4 compression. While we introduce version 2 for CompactibleFileStreamLog, CompactibleFileStreamLog is still compatible with version 1 for both serialization and deserialization, so we can read from version 2 and write to version 2 in normal, and also we can read from version 1 and write to version 2 for smooth migration. (Even we can stay with version 1 - read from version 1 and write to version 1 - whereas we won't do it actually.) The approach how to load and store UnsafeRow with file is borrowed from HDFSBackedStateStoreProvider, as it has been running for 2.x version lines and no critical issue has been reported so far. ### Why are the changes needed? Multiple JIRA issues have been filed to report that huge metadata logs make their queries very slow. While this patch won't make their metadata log stop growing, this patch may heavily reduce down their metadata log and time to process compaction. * [SPARK-24295](https://issues.apache.org/jira/browse/SPARK-24295) * [SPARK-29995](https://issues.apache.org/jira/browse/SPARK-29995) Please find the numbers in the section "How was this patch tested?". ### Does this PR introduce any user-facing change? No. ### How was this patch tested? * Existing UTs (UTs for checking with 2.1.0 compatibility is basically the checking with V1 compatibility, as I don't find any change) * Manually done simple perf. test > Test environment * target metadata: FileStreamSinkLog * compact interval = 10 * prepared 1 compact file and 8 normal file, try to write the next batch which is compact batch * the number of total entries (in compact batch): 1,707,950 * memory usage of total entries (in compact batch): 606,97 MB > V1 * total size of metadata log file (compact): 391.12 MB (410,116,683 bytes) * total elapsed time to compact (avg. of 10 times, excluding min and max): 26,199 ms * total elapsed time to load all entities to compact (avg. of 10 times, excluding min and max): 13,558 ms * total elapsed time to write all entities to compact batch file (avg. of 10 times, excluding min and max): 12,666 ms > V2 To run the test in same environment, I converted input files to V2 before running tests. * total size of metadata log file (compact): 76.36 MB (80,073,732 bytes) (19.52%) * total elapsed time to compact(avg. of 10 times, excluding min and max): 2,948 ms (11.25%) * total elapsed time to load all entities to compact (avg. of 10 times, excluding min and max): 1,023 ms (7.54%) * total elapsed time to write all entities to compact batch file (avg. of 10 times, excluding min and max): 1,956 ms (15.44%)
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
