[ https://issues.apache.org/jira/browse/SPARK-22805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292848#comment-16292848 ]
Sergei Lebedev edited comment on SPARK-22805 at 12/15/17 5:25 PM: ------------------------------------------------------------------ Here're results for a single application with 6K partitions. Admittedly, this is not generalizable to any application, but it gives an idea of the redundancy due to {{StorageLevel}}: || Mode || Size|| |LZ4-compressed|8.1G| |Decompressed|79G| |LZ4-compressed with patch|7.2G| |Decompressed with patch|49G| was (Author: lebedev): Here're results for a single application with 6K partitions: || Mode || Size|| |LZ4-compressed|8.1G| |Decompressed|79G| |LZ4-compressed with patch|7.2G| |Decompressed with patch|49G| > Use aliases for StorageLevel in event logs > ------------------------------------------ > > Key: SPARK-22805 > URL: https://issues.apache.org/jira/browse/SPARK-22805 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 2.1.2, 2.2.1 > Reporter: Sergei Lebedev > Priority: Minor > > Fact 1: {{StorageLevel}} has a private constructor, therefore a list of > predefined levels is not extendable (by the users). > Fact 2: The format of event logs uses redundant representation for storage > levels > {code} > >>> len('{"Use Disk": true, "Use Memory": false, "Deserialized": true, > >>> "Replication": 1}') > 79 > >>> len('DISK_ONLY') > 9 > {code} > Fact 3: This leads to excessive log sizes for workloads with lots of > partitions, because every partition would have the storage level field which > is 60-70 bytes more than it should be. > Suggested quick win: use the names of the predefined levels to identify them > in the event log. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org