[
https://issues.apache.org/jira/browse/SPARK-30616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wenchen Fan reassigned SPARK-30616:
-----------------------------------
Assignee: Yaroslav Tkachenko
> Introduce TTL config option for SQL Metadata Cache
> --------------------------------------------------
>
> Key: SPARK-30616
> URL: https://issues.apache.org/jira/browse/SPARK-30616
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.1.0
> Reporter: Yaroslav Tkachenko
> Assignee: Yaroslav Tkachenko
> Priority: Major
> Fix For: 3.1.0
>
>
> From
> [documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]:
> {quote}Spark SQL caches Parquet metadata for better performance. When Hive
> metastore Parquet table conversion is enabled, metadata of those converted
> tables are also cached. If these tables are updated by Hive or other external
> tools, you need to refresh them manually to ensure consistent metadata.
> {quote}
> Currently Spark [caches file listing for
> tables|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]
> and requires issuing "{{REFRESH TABLE"}} any time the file listing has
> changed outside of Spark. Unfortunately, simply submitting "{{REFRESH
> TABLE"}} commands could be very cumbersome. Assuming frequently added files,
> hundreds of tables and dozens of users querying the data (and expecting
> up-to-date results), manually refreshing metadata for each table is not a
> solution.
> This is a pretty common use-case for streaming ingestion of data, which can
> be done outside of Spark (with tools like Kafka Connect, etc.).
> A similar feature exists in Presto: {{hive.file-status-cache-expire-time}}
> can be found
> [here|https://prestosql.io/docs/current/connector/hive.html#hive-configuration-properties].
> I propose to introduce a new option in Spark (something like
> "spark.sql.hive.filesourcePartitionFileCacheTTL") that controls the TTL of
> this metadata cache. It can be disabled by default (-1), so it doesn't change
> the existing behaviour.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]