[ https://issues.apache.org/jira/browse/SPARK-32274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan reassigned SPARK-32274: ----------------------------------- Assignee: Robert Joseph Evans > Add in the ability for a user to replace the serialization format of the cache > ------------------------------------------------------------------------------ > > Key: SPARK-32274 > URL: https://issues.apache.org/jira/browse/SPARK-32274 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.1.0 > Reporter: Robert Joseph Evans > Assignee: Robert Joseph Evans > Priority: Major > > Caching a dataset or dataframe can be a very expensive operation, but has a > huge benefit for later queries that use it. There are many use cases that > could benefit from caching the data but not enough to justify the current > scheme. I would like to propose that we make the serialization of the > caching plugable. That way users can explore other formats and compression > code. > > As an example I took the line item table from TPCH at a scale factor of 10 > and converted it to parquet. This resulted in 2.1 GB of data on disk. With > the current caching it can take nearly 8 GB to store that same data in > memory, and about 5 GB to store in on disk. > > If I want to read all of that data and and write it out again. > {code:java} > scala> val a = spark.read.parquet("../data/tpch/SF10_parquet/lineitem.tbl/") > a: org.apache.spark.sql.DataFrame = [l_orderkey: bigint, l_partkey: bigint > ... 14 more fields] > scala> spark.time(a.write.mode("overwrite").parquet("./target/tmp")) > Time taken: 25832 ms {code} > But a query that reads that data directly from the cache after it is built > only takes 21531 ms. For some queries having much more data that can be > stored in the cache might be worth the extra query time. > > It also takes about a lot less time to do the parquet compression than it > does to do the cache compression. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org