Matei Zaharia created SPARK-3091: ------------------------------------ Summary: Add support for caching metadata on Parquet files Key: SPARK-3091 URL: https://issues.apache.org/jira/browse/SPARK-3091 Project: Spark Issue Type: New Feature Components: SQL Reporter: Matei Zaharia Assignee: Matei Zaharia
For larger Parquet files, reading the file footers (which is done in parallel on up to 5 threads) and HDFS block locations (which is serial) can take multiple seconds. We can add an option to cache this data within FilteringParquetInputFormat. Unfortunately ParquetInputFormat only caches footers within each instance of ParquetInputFormat, not across them. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org