HeartSaVioR opened a new pull request #27694: [SPARK-30946][SS] Serde entry 
with UnsafeRow on FileStream(Source/Sink)Log with LZ4 compression
URL: https://github.com/apache/spark/pull/27694
 
 
   ### What changes were proposed in this pull request?
   
   This patch uses UnsafeRow to serialize/deserialize entry on 
FileStream(Source/Sink)Log, as well as applying LZ4 compression on serde.
   
   The ground idea is that both FileEntry and SinkFileStatus haven't been 
changed for 3 years but we have been simply taking JSON serde to prepare with 
possible change and spending bunch of overhead on JSON format. 
CompactibleFileStreamLog has the metadata version information but it has been 
just version 1.
   
   This patch introduces metadata version 2 of CompactibleFileStreamLog as 
leveraging UnsafeRow & LZ4 compression. While we introduce version 2 for 
CompactibleFileStreamLog, CompactibleFileStreamLog is still compatible with 
version 1 for both serialization and deserialization, so we can read from 
version 2 and write to version 2 in normal, and also we can read from version 1 
and write to version 2 for smooth migration.
   (Even we can stay with version 1 - read from version 1 and write to version 
1 - whereas we won't do it actually.)
   
   The approach how to load and store UnsafeRow with file is borrowed from 
HDFSBackedStateStoreProvider, as it has been running for 2.x version lines and 
no critical issue has been reported so far.
   
   ### Why are the changes needed?
   
   Multiple JIRA issues have been filed to report that huge metadata logs make 
their queries very slow. While this patch won't make their metadata log stop 
growing, this patch may heavily reduce down their metadata log and time to 
process compaction.
   
   * [SPARK-24295](https://issues.apache.org/jira/browse/SPARK-24295)
   * [SPARK-29995](https://issues.apache.org/jira/browse/SPARK-29995)
   
   Please find the numbers in the section "How was this patch tested?".
   
   ### Does this PR introduce any user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   * Existing UTs (UTs for checking with 2.1.0 compatibility is basically the 
checking with V1 compatibility, as I don't find any change)
   * Manually done simple perf. test
   
   > Test environment
   
   * target metadata: FileStreamSinkLog
   * compact interval = 10
   * prepared 1 compact file and 8 normal file, try to write the next batch 
which is compact batch
   * the number of total entries (in compact batch): 1,707,950
   * memory usage of total entries (in compact batch): 606,97 MB
   
   > V1
   
   * total size of metadata log file (compact): 391.12 MB (410,116,683 bytes)
   * total elapsed time to compact (avg. of 10 times, excluding min and max): 
26,199 ms
   * total elapsed time to load all entities to compact (avg. of 10 times, 
excluding min and max): 13,558 ms
   * total elapsed time to write all entities to compact batch file (avg. of 10 
times, excluding min and max): 12,666 ms
   
   > V2
   
   To run the test in same environment, I converted input files to V2 before 
running tests.
   
   * total size of metadata log file (compact): 76.36 MB (80,073,732 bytes) 
(19.52%)
   * total elapsed time to compact(avg. of 10 times, excluding min and max): 
2,948 ms (11.25%)
   * total elapsed time to load all entities to compact (avg. of 10 times, 
excluding min and max): 1,023 ms (7.54%)
   * total elapsed time to write all entities to compact batch file (avg. of 10 
times, excluding min and max): 1,956 ms (15.44%)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to