[ 
https://issues.apache.org/jira/browse/SPARK-30616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaroslav Tkachenko updated SPARK-30616:
---------------------------------------
    Description: 
>From 
>[documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]:
{quote}Spark SQL caches Parquet metadata for better performance. When Hive 
metastore Parquet table conversion is enabled, metadata of those converted 
tables are also cached. If these tables are updated by Hive or other external 
tools, you need to refresh them manually to ensure consistent metadata.
{quote}
Currently Spark [caches file listing for 
tables|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]
 and requires issuing "{{REFRESH TABLE"}} any time the file listing has changed 
outside of Spark. Unfortunately, simply submitting "{{REFRESH TABLE"}} commands 
could be very cumbersome. Assuming frequently added files, hundreds of tables 
and dozens of users querying the data (and expecting up-to-date results), 
manually refreshing metadata for each table is not a solution.

This is a pretty common use-case for streaming ingestion of data, which can be 
done outside of Spark (with tools like Kafka Connect, etc.).

A similar feature exists in Presto: {{hive.file-status-cache-expire-time}} can 
be found 
[here|https://prestosql.io/docs/current/connector/hive.html#hive-configuration-properties].

I propose to introduce a new option in Spark (something like 
"spark.sql.hive.filesourcePartitionFileCacheTTL") that controls the TTL of this 
metadata cache. It can be disabled by default (-1), so it doesn't change the 
existing behaviour. 

  was:
>From 
>[documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]:
{quote}Spark SQL caches Parquet metadata for better performance. When Hive 
metastore Parquet table conversion is enabled, metadata of those converted 
tables are also cached. If these tables are updated by Hive or other external 
tools, you need to refresh them manually to ensure consistent metadata.
{quote}
Unfortunately simply submitting "REFRESH TABLE"  commands could be very 
cumbersome. Assuming frequently generated new Parquet files, hundreds of tables 
and dozens of users querying the data (and expecting up-to-date results), 
manually refreshing metadata for each table is not an optimal solution. And 
this is a pretty common use-case for streaming ingestion of data.    

I propose to introduce a new option in Spark (something like 
"spark.sql.hive.filesourcePartitionFileCacheTTL") that controls the TTL of this 
metadata cache. It can be disabled by default (-1), so it doesn't change the 
existing behaviour. 


> Introduce TTL config option for SQL Metadata Cache
> --------------------------------------------------
>
>                 Key: SPARK-30616
>                 URL: https://issues.apache.org/jira/browse/SPARK-30616
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Yaroslav Tkachenko
>            Assignee: Apache Spark
>            Priority: Major
>
> From 
> [documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]:
> {quote}Spark SQL caches Parquet metadata for better performance. When Hive 
> metastore Parquet table conversion is enabled, metadata of those converted 
> tables are also cached. If these tables are updated by Hive or other external 
> tools, you need to refresh them manually to ensure consistent metadata.
> {quote}
> Currently Spark [caches file listing for 
> tables|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]
>  and requires issuing "{{REFRESH TABLE"}} any time the file listing has 
> changed outside of Spark. Unfortunately, simply submitting "{{REFRESH 
> TABLE"}} commands could be very cumbersome. Assuming frequently added files, 
> hundreds of tables and dozens of users querying the data (and expecting 
> up-to-date results), manually refreshing metadata for each table is not a 
> solution.
> This is a pretty common use-case for streaming ingestion of data, which can 
> be done outside of Spark (with tools like Kafka Connect, etc.).
> A similar feature exists in Presto: {{hive.file-status-cache-expire-time}} 
> can be found 
> [here|https://prestosql.io/docs/current/connector/hive.html#hive-configuration-properties].
> I propose to introduce a new option in Spark (something like 
> "spark.sql.hive.filesourcePartitionFileCacheTTL") that controls the TTL of 
> this metadata cache. It can be disabled by default (-1), so it doesn't change 
> the existing behaviour. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to