[ https://issues.apache.org/jira/browse/SPARK-30616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138840#comment-17138840 ]
Apache Spark commented on SPARK-30616: -------------------------------------- User 'sap1ens' has created a pull request for this issue: https://github.com/apache/spark/pull/28852 > Introduce TTL config option for SQL Metadata Cache > -------------------------------------------------- > > Key: SPARK-30616 > URL: https://issues.apache.org/jira/browse/SPARK-30616 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.1.0 > Reporter: Yaroslav Tkachenko > Priority: Major > > From > [documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]: > {quote}Spark SQL caches Parquet metadata for better performance. When Hive > metastore Parquet table conversion is enabled, metadata of those converted > tables are also cached. If these tables are updated by Hive or other external > tools, you need to refresh them manually to ensure consistent metadata. > {quote} > Unfortunately simply submitting "REFRESH TABLE" commands could be very > cumbersome. Assuming frequently generated new Parquet files, hundreds of > tables and dozens of users querying the data (and expecting up-to-date > results), manually refreshing metadata for each table is not an optimal > solution. And this is a pretty common use-case for streaming ingestion of > data. > I propose to introduce a new option in Spark (something like > "spark.sql.hive.filesourcePartitionFileCacheTTL") that controls the TTL of > this metadata cache. It can be disabled by default (-1), so it doesn't change > the existing behaviour. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org