[
https://issues.apache.org/jira/browse/SPARK-32274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-32274:
------------------------------------
Assignee: (was: Apache Spark)
> Add in the ability for a user to replace the serialization format of the cache
> ------------------------------------------------------------------------------
>
> Key: SPARK-32274
> URL: https://issues.apache.org/jira/browse/SPARK-32274
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.1.0
> Reporter: Robert Joseph Evans
> Priority: Major
>
> Caching a dataset or dataframe can be a very expensive operation, but has a
> huge benefit for later queries that use it. There are many use cases that
> could benefit from caching the data but not enough to justify the current
> scheme. I would like to propose that we make the serialization of the
> caching plugable. That way users can explore other formats and compression
> code.
>
> As an example I took the line item table from TPCH at a scale factor of 10
> and converted it to parquet. This resulted in 2.1 GB of data on disk. With
> the current caching it can take nearly 8 GB to store that same data in
> memory, and about 5 GB to store in on disk.
>
> If I want to read all of that data and and write it out again.
> {code:java}
> scala> val a = spark.read.parquet("../data/tpch/SF10_parquet/lineitem.tbl/")
> a: org.apache.spark.sql.DataFrame = [l_orderkey: bigint, l_partkey: bigint
> ... 14 more fields]
> scala> spark.time(a.write.mode("overwrite").parquet("./target/tmp"))
> Time taken: 25832 ms {code}
> But a query that reads that data directly from the cache after it is built
> only takes 21531 ms. For some queries having much more data that can be
> stored in the cache might be worth the extra query time.
>
> It also takes about a lot less time to do the parquet compression than it
> does to do the cache compression.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]