[ 
https://issues.apache.org/jira/browse/SPARK-30616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30616:
-----------------------------------

    Assignee: Yaroslav Tkachenko

> Introduce TTL config option for SQL Metadata Cache
> --------------------------------------------------
>
>                 Key: SPARK-30616
>                 URL: https://issues.apache.org/jira/browse/SPARK-30616
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Yaroslav Tkachenko
>            Assignee: Yaroslav Tkachenko
>            Priority: Major
>             Fix For: 3.1.0
>
>
> From 
> [documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]:
> {quote}Spark SQL caches Parquet metadata for better performance. When Hive 
> metastore Parquet table conversion is enabled, metadata of those converted 
> tables are also cached. If these tables are updated by Hive or other external 
> tools, you need to refresh them manually to ensure consistent metadata.
> {quote}
> Currently Spark [caches file listing for 
> tables|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]
>  and requires issuing "{{REFRESH TABLE"}} any time the file listing has 
> changed outside of Spark. Unfortunately, simply submitting "{{REFRESH 
> TABLE"}} commands could be very cumbersome. Assuming frequently added files, 
> hundreds of tables and dozens of users querying the data (and expecting 
> up-to-date results), manually refreshing metadata for each table is not a 
> solution.
> This is a pretty common use-case for streaming ingestion of data, which can 
> be done outside of Spark (with tools like Kafka Connect, etc.).
> A similar feature exists in Presto: {{hive.file-status-cache-expire-time}} 
> can be found 
> [here|https://prestosql.io/docs/current/connector/hive.html#hive-configuration-properties].
> I propose to introduce a new option in Spark (something like 
> "spark.sql.hive.filesourcePartitionFileCacheTTL") that controls the TTL of 
> this metadata cache. It can be disabled by default (-1), so it doesn't change 
> the existing behaviour. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to